[htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0


Subject: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0
From: Peter L. Peres (plp@actcom.co.il)
Date: Fri May 05 2000 - 03:31:37 PDT


Hi,

here is everything. Please let me know if I broke something. imho the
Makefile of a production release should have -O(as high as you dare) and
no -g and it should strip the binaries. The difference in runtime on small
machines is significant. I have modified the Makefile.in to remove -g
with gcc and g++, but it does not strip by itself (you have to do it by
hand, especially htdig and htsearch).

Now, I'll run the new binaries over night on the whole thing. I expect a
nice surprise in the morning ;-)

bye,

        Peter

diff -rcN tmp/htdig-3.1.5/README.plp.txt htdig-3.1.5/README.plp.txt
*** tmp/htdig-3.1.5/README.plp.txt Thu Jan 1 02:00:00 1970
--- htdig-3.1.5/README.plp.txt Fri May 5 13:03:50 2000
***************
*** 0 ****
--- 1,119 ----
+
+ About the patch to allow htdig to index soft-linked directories without
+ indexing the parent directories.
+
+ Applies to: htdig stable 3.1.5
+
+
+ 1. Description:
+
+ 1.1 The problem:
+
+ When on an open system (ex: Linux) used on an intranet (no direct connection
+ to the Internet), documentation is added to the HTML DocumentRoot tree, by
+ adding symbolic links to the documentation under the DocumentRoot, and htdig
+ is used to index this information, then htdig (3.1.5) will enter an endless
+ loop or try to index the entire system.
+
+ It does this by reaping the url of the 'parent directory' in
+ Apache-generated indexes of directories (such as, of the directories that
+ are soft-linked under the DocumentRoot). The 'parent directories' of a
+ sirectory entered by a symbolic link, leads back all the way to root '/'. If
+ the patch is not applied, then htdig will try to index the entire system,
+ and may loop if any cross-linking exists.
+
+ 1.2 The solution:
+
+ To avoid this, a mechanism is implemented in htdig, that prevents it from
+ reaping and indexing any URLs that are the direct parents of the currently
+ indexed document. For example:
+
+ If the document http://here/a/b/c is bein indexed, then if the following
+ URLs that will be reaped from it, will need not to be added to the list of
+ URLs to be indexed:
+
+ http://
+ http://here
+ http://here/a
+ http://here/a/b
+
+ In particular the last one would appear as a 'previous directory' entry in
+ an Apache-generated directory index.
+
+
+ 2. Patch
+
+ 2.1 Patch description:
+
+ The patch modifies the htdig/Retriever class to add the required
+ functionality, and adds a new configuration option, that turns the new
+ feature ON or OFF.
+
+ The feature is turned OFF by default, and it needs to be turned ON by an
+ entry in the config file used with htdig using a line like:
+
+ prune_parent_dir_href: true
+
+ 2.2 Patch application:
+
+ copy the patch source to the htdig-3.1.15 directory and then apply the patch
+ using the command:
+
+ patch -p1 <htdig-3.1.5+prune_parent_dir_href-0.0.patch
+
+ then recompile and reinstall the htdig (make; make install). Edit the config
+ file to turn on the new option, add a symbolic link to the DocumentRoot
+ (f.ex. cd /usr/local/httpd/htdocs/misc; ln -s /usr/doc .; on Suse systems),
+ and run htdig (rundig).
+
+ NOTE that if you upgrade Suse htdig to 3.1.5 then you have to edit the Suse
+ image, search and cgi-bin directories in CONFIG before compiling, as they
+ are not standard.
+
+ 2.3 Patch problems:
+
+ Note that the symbolic links under DocumentRoot have security implications.
+ While normal web sites have paranoid thoughts about security when serving
+ files from outside the DocumentRoot, open systems (Linux in particular),
+ should not have any. On a stock Linux installation, visitors can visit the
+ contents of a stock Linux installation, which is also available elsewhere
+ (presumably with larger bandwidth). FYI directory browse access for Apache
+ on Linux is disbaled by having its global permissions reset (xxxxxx---).
+
+ The patch may prevent some sites that are not entered at the top from being
+ indexed properly. For example, if a site is started as:
+
+ http://somewhere/pub/someone/start/here.html
+
+ then anything not below http://somewhere/pub/someone/start will be omitted,
+ even if it is linked to from here.html
+
+ This is not a problem for most sites, which are entered at the top. If you
+ have funny sites, then you will need funny configurations. ;-)
+
+ 2.4 Patch function indication:
+
+ To see the patch working, run htsearch with -v. The patch causes a bang
+ (ascii '!') to be printed among the other progress characters, for each url
+ that was pruned by the patch. I did not try to see what happens when more
+ than one -v is used. In theory it should print bangs then too, but I can't
+ tell with what text they will be mixed.
+
+ 3. Some statistics:
+
+ A i486/100MHz with 24MB RAM with EIDE disks (not UDMA) ran htdig -ilv with
+ the applied patch with niceness 10 in about 13 hours and htmerge -v in 2
+ hours. The doc db size reported was 310 MB with 36500 documents int it. The
+ machine was throughly usable during this time, for shell and compilation
+ use, as well as web server use (moderate). The kernel was 2.2.5 Suse Linux
+ (stock).
+
+ This means that a 'legacy' machine can be employed as intranet document
+ server and run htdig about twice a week from cron, without any problems.
+
+ 4. Who did this
+
+ Me, Peter Lorand Peres, plp@actcom.co.il, when I tried to index the
+ documentation (not only html) on my Suse 6.2 system in April/May 2000 and
+ failed, due to the looping problem described above.
+
diff -rcN tmp/htdig-3.1.5/configure htdig-3.1.5/configure
*** tmp/htdig-3.1.5/configure Fri Feb 25 04:28:58 2000
--- htdig-3.1.5/configure Fri May 5 12:26:31 2000
***************
*** 1061,1067 ****
    CFLAGS="$ac_save_CFLAGS"
  elif test $ac_cv_prog_cc_g = yes; then
    if test "$GCC" = yes; then
! CFLAGS="-g -O2"
    else
      CFLAGS="-g"
    fi
--- 1061,1069 ----
    CFLAGS="$ac_save_CFLAGS"
  elif test $ac_cv_prog_cc_g = yes; then
    if test "$GCC" = yes; then
! # plp: production code is optimized, yes ?
! #CFLAGS="-g -O2"
! CFLAGS="-O2"
    else
      CFLAGS="-g"
    fi
***************
*** 1204,1210 ****
    CXXFLAGS="$ac_save_CXXFLAGS"
  elif test $ac_cv_prog_cxx_g = yes; then
    if test "$GXX" = yes; then
! CXXFLAGS="-g -O2"
    else
      CXXFLAGS="-g"
    fi
--- 1206,1214 ----
    CXXFLAGS="$ac_save_CXXFLAGS"
  elif test $ac_cv_prog_cxx_g = yes; then
    if test "$GXX" = yes; then
! # plp: optimize for speed
! #CXXFLAGS="-g -O2"
! CXXFLAGS="-O2"
    else
      CXXFLAGS="-g"
    fi
diff -rcN tmp/htdig-3.1.5/htcommon/defaults.cc htdig-3.1.5/htcommon/defaults.cc
*** tmp/htdig-3.1.5/htcommon/defaults.cc Fri Feb 25 04:29:10 2000
--- htdig-3.1.5/htcommon/defaults.cc Thu May 4 23:14:48 2000
***************
*** 24,29 ****
--- 24,33 ----
      {"pdf_parser", PDF_PARSER " -toPostScript"},
      {"version", VERSION},
  
+
+ // plp
+ {"prune_parent_dir_href", "false"},
+
      //
      // General defaults
      //
diff -rcN tmp/htdig-3.1.5/htdig/HTML.cc htdig-3.1.5/htdig/HTML.cc
*** tmp/htdig-3.1.5/htdig/HTML.cc Fri Feb 25 04:29:10 2000
--- htdig-3.1.5/htdig/HTML.cc Mon May 4 01:11:01 1998
***************
*** 394,400 ****
                    head << word;
              }
  
! if (word.length() >= minimumWordLength && doindex)
              {
                retriever.got_word(word,
                                   int(offset * 1000 / totlength),
--- 394,400 ----
                    head << word;
              }
  
! if ((word.length() >= (unsigned)minimumWordLength) && doindex)
              {
                retriever.got_word(word,
                                   int(offset * 1000 / totlength),
diff -rcN tmp/htdig-3.1.5/htdig/HTML.h htdig-3.1.5/htdig/HTML.h
*** tmp/htdig-3.1.5/htdig/HTML.h Fri Feb 25 04:29:10 2000
--- htdig-3.1.5/htdig/HTML.h Mon May 4 01:22:34 1998
***************
*** 37,43 ****
  class Retriever;
  class URL;
  
-
  class HTML : public Parsable
  {
  public:
--- 37,42 ----
***************
*** 76,81 ****
--- 75,81 ----
      //
      void do_tag(Retriever &, String &);
      char *transSGML(char *);
+
  };
  
  #endif
diff -rcN tmp/htdig-3.1.5/htdig/Retriever.cc htdig-3.1.5/htdig/Retriever.cc
*** tmp/htdig-3.1.5/htdig/Retriever.cc Fri Feb 25 04:29:10 2000
--- htdig-3.1.5/htdig/Retriever.cc Fri May 5 01:58:58 2000
***************
*** 20,29 ****
  #include <stdio.h>
  #include "HtWordType.h"
  
  static WordList words;
  static int noSignal;
  
-
  //*****************************************************************************
  // Retriever::Retriever()
  //
--- 20,31 ----
  #include <stdio.h>
  #include "HtWordType.h"
  
+ // plp
+ #include <string.h>
+
  static WordList words;
  static int noSignal;
  
  //*****************************************************************************
  // Retriever::Retriever()
  //
***************
*** 34,39 ****
--- 36,44 ----
      currenthopcount = 0;
      max_hop_count = config.Value("max_hop_count", 999999);
                  
+ // plp
+ gus.hop_count = 0;
+
      //
      // Initialize the weight factors for words in the different
      // HTML headers
***************
*** 276,295 ****
              // There may be no more documents, or the server
              // has passed the server_max_docs limit
  
! //
! // We have a URL to index, now. We need to register the
! // fact that we are not done yet by setting the 'more'
! // variable.
! //
! more = 1;
!
! //
! // Deal with the actual URL.
! // We'll check with the server to see if we need to sleep()
! // before parsing it.
! //
! server->delay(); // This will pause if needed and reset the time
! parse_url(*ref);
              delete ref;
          }
      }
--- 281,306 ----
              // There may be no more documents, or the server
              // has passed the server_max_docs limit
  
! // plp: store and preprocess new url for parent dir stripping
! if (config.Boolean("prune_parent_dir_href", 0))
! store_url(ref->URL());
! else
! gus.hop_count = 0; // avoid chk config w every href
!
! //
! // We have a URL to index, now. We need to register the
! // fact that we are not done yet by setting the 'more'
! // variable.
! //
! more = 1;
!
! //
! // Deal with the actual URL.
! // We'll check with the server to see if we need to sleep()
! // before parsing it.
! //
! server->delay(); // This will pause if needed and reset the time
! parse_url(*ref);
              delete ref;
          }
      }
***************
*** 1147,1152 ****
--- 1158,1164 ----
      if (urls_seen)
          fprintf(urls_seen, "%s\n", url.get());
  
+
      //
      // Check if this URL falls within the valid range of URLs.
      //
***************
*** 1164,1169 ****
--- 1176,1189 ----
  
          url.normalize();
  
+ // plp: check whether it is a substring of the base URL
+ if((gus.hop_count > 0) && (url_is_parent_dir(url.get()) != 0)) {
+ // cout << "got_href: pruning (is substr of base url) " << url.get() << "\n"; // debug
+ if(debug > 0)
+ cout << "!"; // bang ! in the progress indicator characters
+ return;
+ }
+
          // If it is a backlink from the current document,
          // just update that field. Writing to the database
          // is meaningless, as it will be overwritten.
***************
*** 1521,1523 ****
--- 1541,1611 ----
      }
  }
  
+ // plp
+ // private function used to chop and store the url for substring comparison
+ void
+ Retriever::chop_url(ChoppedUrlStore &cus,char *c_url)
+ {
+ int l;
+
+ cus.url_store[0] = '\0';
+ cus.hop_count = 0;
+ l = strlen(c_url);
+ if((l == 0) || (l > MAX_CAN_URL_LEN)) {
+ if(debug > 0)
+ cout << "chop_url: failed on len==0\n";
+ return;
+ }
+ strcpy(cus.url_store,c_url);
+ l = 0;
+ if((cus.url_store_chopped[l++] = strtok(cus.url_store,"/")) == NULL) {
+ cus.url_store[0] = '\0';
+ if(debug > 0)
+ cout << "chop_url: failed on NULL with " << c_url << "\n";
+ return;
+ }
+ while((cus.url_store_chopped[l++] = strtok(NULL,"/")) != NULL) {
+ if(l > MAX_CAN_URL_HOPS) {
+ cus.url_store[0] = '\0';
+ return; // fail silently with a valid url, print a bang somewhere else
+ }
+ }
+ cus.hop_count = l - 1;
+ return; // success
+ }
+
+ // call this function to store the base URL of a document being indexed,
+ // when starting to index it (in HTML::parse or ExternalParser::parse)
+ void
+ Retriever::store_url(char *c_url)
+ {
+ chop_url(gus,c_url);
+ return;
+ }
+
+ // call this function to decide if a reaped URL is a direct parent of
+ // the URL being indexed. call in Retriever::got_href()
+ int
+ Retriever::url_is_parent_dir(char *c_url)
+ {
+ int j,k;
+ ChoppedUrlStore cus;
+
+ if(gus.hop_count == 0)
+ return 0;
+
+ chop_url(cus,c_url);
+ if(cus.hop_count == 0)
+ return 0;
+
+ // seek a matching last part, backwards
+ j = gus.hop_count - 1;
+ k = cus.hop_count - 1;
+ while(strcmp(gus.url_store_chopped[j],cus.url_store_chopped[k]) != 0)
+ if(--j < 0)
+ return 0; // not
+ while((--j >= 0)&&(--k >= 0))
+ if(strcmp(gus.url_store_chopped[j],cus.url_store_chopped[k]) != 0)
+ return 0; // not
+ return 1; // yes
+ }
diff -rcN tmp/htdig-3.1.5/htdig/Retriever.h htdig-3.1.5/htdig/Retriever.h
*** tmp/htdig-3.1.5/htdig/Retriever.h Fri Feb 25 04:29:10 2000
--- htdig-3.1.5/htdig/Retriever.h Thu May 4 22:32:05 2000
***************
*** 24,29 ****
--- 24,35 ----
      Retriever_Restart
  };
  
+ // plp 000503 - for prune_parent_href feature
+ // max length of URL, in chars, fail silently if exceeded
+ #define MAX_CAN_URL_LEN 256
+ // max no. of slashes in same + 1, fail silently if exceeded
+ #define MAX_CAN_URL_HOPS 32
+
  class Retriever
  {
  public:
***************
*** 64,79 ****
      // Allow for the indexing of protected sites by using a
      // username/password
      //
! void setUsernamePassword(char *credentials);
  
      //
      // Routines for dealing with local filesystem access
      //
      StringList * GetLocal(char *url);
      StringList * GetLocalUser(char *url, StringList *defaultdocs);
! int IsLocalURL(char *url);
!
  private:
      //
      // A hash to keep track of what we've seen
      //
--- 70,102 ----
      // Allow for the indexing of protected sites by using a
      // username/password
      //
! void setUsernamePassword(char *credentials);
  
      //
      // Routines for dealing with local filesystem access
      //
      StringList * GetLocal(char *url);
      StringList * GetLocalUser(char *url, StringList *defaultdocs);
! int IsLocalURL(char *url);
!
! // plp 000503 - for prune_parent_href feature
! void store_url(char *c_url);
! int url_is_parent_dir(char *c_url);
!
  private:
+
+ // plp 000503 - for prune_parent_href feature
+ typedef struct {
+ char url_store[MAX_CAN_URL_LEN];
+ char *url_store_chopped[MAX_CAN_URL_HOPS];
+ int hop_count; // the last valid index in url_store_chopped + 1 or zero
+ } ChoppedUrlStore;
+
+ ChoppedUrlStore gus; // Global chopped Url Store
+
+ void chop_url(ChoppedUrlStore &cus,char *c_url);
+ // /plp
+
      //
      // A hash to keep track of what we've seen
      //


About the patch to allow htdig to index soft-linked directories without
indexing the parent directories.

Applies to: htdig stable 3.1.5


1. Description:
  
1.1 The problem:

When on an open system (ex: Linux) used on an intranet (no direct connection
to the Internet), documentation is added to the HTML DocumentRoot tree, by
adding symbolic links to the documentation under the DocumentRoot, and htdig
is used to index this information, then htdig (3.1.5) will enter an endless
loop or try to index the entire system.

It does this by reaping the url of the 'parent directory' in
Apache-generated indexes of directories (such as, of the directories that
are soft-linked under the DocumentRoot). The 'parent directories' of a
sirectory entered by a symbolic link, leads back all the way to root '/'. If
the patch is not applied, then htdig will try to index the entire system,
and may loop if any cross-linking exists.

1.2 The solution:

To avoid this, a mechanism is implemented in htdig, that prevents it from
reaping and indexing any URLs that are the direct parents of the currently
indexed document. For example:

If the document http://here/a/b/c is bein indexed, then if the following
URLs that will be reaped from it, will need not to be added to the list of
URLs to be indexed:

http://
http://here
http://here/a
http://here/a/b

In particular the last one would appear as a 'previous directory' entry in
an Apache-generated directory index.


2. Patch

2.1 Patch description:

The patch modifies the htdig/Retriever class to add the required
functionality, and adds a new configuration option, that turns the new
feature ON or OFF.

The feature is turned OFF by default, and it needs to be turned ON by an
entry in the config file used with htdig using a line like:

prune_parent_dir_href: true

2.2 Patch application:

copy the patch source to the htdig-3.1.15 directory and then apply the patch
using the command:

patch -p1 <htdig-3.1.5+prune_parent_dir_href-0.0.patch

then recompile and reinstall the htdig (make; make install). Edit the config
file to turn on the new option, add a symbolic link to the DocumentRoot
(f.ex. cd /usr/local/httpd/htdocs/misc; ln -s /usr/doc .; on Suse systems),
and run htdig (rundig).

NOTE that if you upgrade Suse htdig to 3.1.5 then you have to edit the Suse
image, search and cgi-bin directories in CONFIG before compiling, as they
are not standard.

2.3 Patch problems:

Note that the symbolic links under DocumentRoot have security implications.
While normal web sites have paranoid thoughts about security when serving
files from outside the DocumentRoot, open systems (Linux in particular),
should not have any. On a stock Linux installation, visitors can visit the
contents of a stock Linux installation, which is also available elsewhere
(presumably with larger bandwidth). FYI directory browse access for Apache
on Linux is disbaled by having its global permissions reset (xxxxxx---).

The patch may prevent some sites that are not entered at the top from being
indexed properly. For example, if a site is started as:

http://somewhere/pub/someone/start/here.html

then anything not below http://somewhere/pub/someone/start will be omitted,
even if it is linked to from here.html

This is not a problem for most sites, which are entered at the top. If you
have funny sites, then you will need funny configurations. ;-)

2.4 Patch function indication:

To see the patch working, run htsearch with -v. The patch causes a bang
(ascii '!') to be printed among the other progress characters, for each url
that was pruned by the patch. I did not try to see what happens when more
than one -v is used. In theory it should print bangs then too, but I can't
tell with what text they will be mixed.

3. Some statistics:

A i486/100MHz with 24MB RAM with EIDE disks (not UDMA) ran htdig -ilv with
the applied patch with niceness 10 in about 13 hours and htmerge -v in 2
hours. The doc db size reported was 310 MB with 36500 documents int it. The
machine was throughly usable during this time, for shell and compilation
use, as well as web server use (moderate). The kernel was 2.2.5 Suse Linux
(stock).

This means that a 'legacy' machine can be employed as intranet document
server and run htdig about twice a week from cron, without any problems.

4. Who did this

Me, Peter Lorand Peres, plp@actcom.co.il, when I tried to index the
documentation (not only html) on my Suse 6.2 system in April/May 2000 and
failed, due to the looping problem described above.

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri May 05 2000 - 12:57:45 PDT