[htdig] prune_parent_dir_href-0.1.patch


Subject: [htdig] prune_parent_dir_href-0.1.patch
From: Peter L. Peres (plp@actcom.co.il)
Date: Wed May 10 2000 - 08:49:33 PDT


Hi,

here is the patch, fixed 3 serious bugs, reduced the number of blank lines
and other garbage diffed, leaves config alone now.

Peter

diff -rcN tmp/htdig-3.1.5/README.plp.txt htdig-3.1.5/README.plp.txt
*** tmp/htdig-3.1.5/README.plp.txt Thu Jan 1 02:00:00 1970
--- htdig-3.1.5/README.plp.txt Sun May 10 20:37:42 1998
***************
*** 0 ****
--- 1,121 ----
+
+ About the patch to allow htdig to index soft-linked directories without
+ indexing the parent directories.
+
+ Applies to: htdig stable 3.1.5
+ Related files: the.patch htdig-3.1.5+prune_parent_dir_href-0.1.patch
+ Related URLs: http://www.htdig.org htdig mailing list archive
+
+ 1. Description:
+
+ 1.1 The problem:
+
+ When htdig indexes a page under a DocumentRoot noted as http://a/b/c and
+ the page contains a href pointing to http://a/d/e where the latter page
+ is an Apache index, htdig will try to index http://a/d and all its
+ descendents if this is an Apache index.
+
+ 1.2 The solution:
+
+ To avoid this, a mechanism is implemented in htdig, that prevents it from
+ reaping and indexing any URLs that are the direct parents of the currently
+ indexed document. For example:
+
+ If the document http://here/a/b/c is being indexed, then if the following
+ URLs that will be reaped from it, will need not to be added to the list of
+ URLs to be indexed:
+
+ http://
+ http://here
+ http://here/a
+ http://here/a/b
+
+ In particular the last one would appear as a 'previous directory' entry in
+ an Apache-generated directory index for http://here/a/b/c.
+
+ 2. Patch
+
+ 2.1 Patch description:
+
+ The patch modifies the htdig/Retriever class to add the required
+ functionality, and adds a new configuration option, that turns the new
+ feature ON or OFF.
+
+ The feature is turned OFF by default, and it needs to be turned ON by an
+ entry in the config file used with htdig using a line like:
+
+ prune_parent_dir_href: true
+
+ 2.2 Patch application:
+
+ copy the patch source to the htdig-3.1.5 directory and then apply the patch
+ using the command:
+
+ patch -p1 <htdig-3.1.5+prune_parent_dir_href-0.1.patch
+
+ then recompile and reinstall the htdig (make; make install). Edit the config
+ file to turn on the new option, add a symbolic link to the DocumentRoot
+ (f.ex. cd /usr/local/httpd/htdocs/misc; ln -s /usr/doc .; on Suse systems),
+ and run htdig (rundig).
+
+ The patch is distributed with the filename 'the.patch-1' or the long name, and
+ it is available in the htdig mailing list archives at http://www.htdig.org.
+ It should be renamed to htdig-3.1.5+prune_parent_dir_href-0.1.patch for
+ archival purposes.
+
+ NOTE that if you upgrade Suse htdig to 3.1.5 then you have to edit the Suse
+ image, search and cgi-bin directories in CONFIG before compiling, as they
+ are not standard.
+
+ 2.3 Patch problems:
+
+ Note that the symbolic links under DocumentRoot have security implications.
+ While normal web sites have paranoid thoughts about security when serving
+ files from outside the DocumentRoot, open systems (Linux in particular),
+ should not have any. On a stock Linux installation, visitors can visit the
+ contents of a stock Linux installation, which is also available elsewhere
+ (presumably with larger bandwidth). FYI directory browse access for Apache
+ on Linux is disbaled by having its global permissions reset (xxxxxx---).
+
+ The patch may prevent some sites that are not entered at the top from being
+ indexed properly. For example, if a site is started as:
+
+ http://somewhere/pub/someone/start/here.html
+
+ then anything not below http://somewhere/pub/someone/start will be omitted,
+ even if it is linked to from here.html
+
+ This is not a problem for most sites, which are entered at the top. If you
+ have funny sites, then you will need funny configurations. ;-)
+
+ 2.4 Patch function indication:
+
+ To see the patch working, run htdig with -v. The patch causes a bang
+ (ascii '!') to be printed among the other progress characters, for each url
+ that was pruned by the patch. I did not try to see what happens when more
+ than one -v is used. In theory it should print bangs then too, but I can't
+ tell with what text they will be mixed.
+
+ 3. Some statistics:
+
+ A i486/100MHz with 24MB RAM with EIDE disks (not UDMA) ran htdig -ilv with
+ the applied patch with niceness 10 in about 13 hours and htmerge -v in 2
+ hours. The doc db size reported was 310 MB with 36500 documents in it. The
+ machine was throughly usable during this time, for shell and compilation
+ use, as well as web server use (moderate). The kernel was 2.2.5 Suse Linux
+ (stock).
+
+ This means that a 'legacy' machine can be employed as intranet document
+ server and run htdig about twice a week from cron, without any problems.
+
+ New with 0.1: the optimized runtime of htdig is 30 to 40% faster on the
+ same machine. It pays to optimize this program ! Select all the availab-
+ le speedup options in the Make config for best results, and add your
+ own !
+
+ 4. Who did this
+
+ Me, Peter Lorand Peres, plp@actcom.co.il, when I tried to index the
+ documentation (not only html) on my Suse 6.2 system in April/May 2000 and
+ failed, due to the looping problem described above.
+
diff -rcN tmp/htdig-3.1.5/htcommon/defaults.cc htdig-3.1.5/htcommon/defaults.cc
*** tmp/htdig-3.1.5/htcommon/defaults.cc Fri Feb 25 04:29:10 2000
--- htdig-3.1.5/htcommon/defaults.cc Thu May 4 23:14:48 2000
***************
*** 24,29 ****
--- 24,33 ----
      {"pdf_parser", PDF_PARSER " -toPostScript"},
      {"version", VERSION},
  
+
+ // plp
+ {"prune_parent_dir_href", "false"},
+
      //
      // General defaults
      //
diff -rcN tmp/htdig-3.1.5/htdig/Retriever.cc htdig-3.1.5/htdig/Retriever.cc
*** tmp/htdig-3.1.5/htdig/Retriever.cc Fri Feb 25 04:29:10 2000
--- htdig-3.1.5/htdig/Retriever.cc Sun May 10 20:45:53 1998
***************
*** 20,29 ****
  #include <stdio.h>
  #include "HtWordType.h"
  
  static WordList words;
  static int noSignal;
  
-
  //*****************************************************************************
  // Retriever::Retriever()
  //
--- 20,31 ----
  #include <stdio.h>
  #include "HtWordType.h"
  
+ // plp
+ #include <string.h>
+
  static WordList words;
  static int noSignal;
  
  //*****************************************************************************
  // Retriever::Retriever()
  //
***************
*** 34,39 ****
--- 36,44 ----
      currenthopcount = 0;
      max_hop_count = config.Value("max_hop_count", 999999);
                  
+ // plp
+ gus.hop_count = 0;
+
      //
      // Initialize the weight factors for words in the different
      // HTML headers
***************
*** 276,295 ****
              // There may be no more documents, or the server
              // has passed the server_max_docs limit
  
! //
! // We have a URL to index, now. We need to register the
! // fact that we are not done yet by setting the 'more'
! // variable.
! //
! more = 1;
!
! //
! // Deal with the actual URL.
! // We'll check with the server to see if we need to sleep()
! // before parsing it.
! //
! server->delay(); // This will pause if needed and reset the time
! parse_url(*ref);
              delete ref;
          }
      }
--- 281,306 ----
              // There may be no more documents, or the server
              // has passed the server_max_docs limit
  
! // plp: store and preprocess new url for parent dir stripping
! if (config.Boolean("prune_parent_dir_href", 0))
! store_url(ref->URL());
! else
! gus.hop_count = 0; // avoid chk config w every href
!
! //
! // We have a URL to index, now. We need to register the
! // fact that we are not done yet by setting the 'more'
! // variable.
! //
! more = 1;
!
! //
! // Deal with the actual URL.
! // We'll check with the server to see if we need to sleep()
! // before parsing it.
! //
! server->delay(); // This will pause if needed and reset the time
! parse_url(*ref);
              delete ref;
          }
      }
***************
*** 1164,1169 ****
--- 1175,1188 ----
  
          url.normalize();
  
+ // plp: check whether it is a substring of the base URL
+ if((gus.hop_count > 0) && (url_is_parent_dir(url.get()) != 0)) {
+ // cout << "got_href: pruning (is substr of base url) " << url.get() << "\n"; // debug
+ if(debug > 0)
+ cout << "!"; // bang ! in the progress indicator characters
+ return;
+ }
+
          // If it is a backlink from the current document,
          // just update that field. Writing to the database
          // is meaningless, as it will be overwritten.
***************
*** 1521,1523 ****
--- 1540,1607 ----
      }
  }
  
+ // plp
+ // private function used to chop and store the url for substring comparison
+ void
+ Retriever::chop_url(ChoppedUrlStore &cus,char *c_url)
+ {
+ int l;
+
+ cus.url_store[0] = '\0';
+ cus.hop_count = 0;
+ l = strlen(c_url);
+ if((l == 0) || (l >= MAX_CAN_URL_LEN)) {
+ if(debug > 0)
+ cout << "chop_url: failed on len==0\n";
+ return;
+ }
+ strcpy(cus.url_store,c_url);
+ l = 0;
+ if((cus.url_store_chopped[l++] = strtok(cus.url_store,"/")) == NULL) {
+ cus.url_store[0] = '\0';
+ if(debug > 0)
+ cout << "chop_url: failed on NULL with " << c_url << "\n";
+ return;
+ }
+ while((cus.url_store_chopped[l++] = strtok(NULL,"/")) != NULL) {
+ if(l >= MAX_CAN_URL_HOPS) {
+ cus.url_store[0] = '\0';
+ return; // fail silently with a valid url, print a bang somewhere else
+ }
+ }
+ cus.hop_count = l - 1;
+ return; // success
+ }
+
+ // call this function to store the base URL of a document being indexed,
+ // when starting to index it (in HTML::parse or ExternalParser::parse)
+ void
+ Retriever::store_url(char *c_url)
+ {
+ chop_url(gus,c_url);
+ return;
+ }
+
+ // call this function to decide if a reaped URL is a direct parent of
+ // the URL being indexed. call in Retriever::got_href()
+ int
+ Retriever::url_is_parent_dir(char *c_url)
+ {
+ int j,k;
+ ChoppedUrlStore cus;
+
+ if(gus.hop_count == 0)
+ return 0;
+ chop_url(cus,c_url);
+ if(cus.hop_count == 0)
+ return 0;
+ // seek a matching first part (gus == substr of cus)
+ j = k = 0;
+ while(strcmp(gus.url_store_chopped[j++],cus.url_store_chopped[k++]) == 0) {
+ if(k == cus.hop_count)
+ return 1; // substring !
+ if(j == gus.hop_count)
+ break; // not
+ }
+ return 0; // not
+ }
diff -rcN tmp/htdig-3.1.5/htdig/Retriever.h htdig-3.1.5/htdig/Retriever.h
*** tmp/htdig-3.1.5/htdig/Retriever.h Fri Feb 25 04:29:10 2000
--- htdig-3.1.5/htdig/Retriever.h Sun May 10 20:06:31 1998
***************
*** 24,29 ****
--- 24,35 ----
      Retriever_Restart
  };
  
+ // plp 000503 - for prune_parent_href feature
+ // max length of URL, in chars, fail silently if exceeded
+ #define MAX_CAN_URL_LEN 256
+ // max no. of slashes in same + 1, fail silently if exceeded
+ #define MAX_CAN_URL_HOPS 32
+
  class Retriever
  {
  public:
***************
*** 64,79 ****
      // Allow for the indexing of protected sites by using a
      // username/password
      //
! void setUsernamePassword(char *credentials);
  
      //
      // Routines for dealing with local filesystem access
      //
      StringList * GetLocal(char *url);
      StringList * GetLocalUser(char *url, StringList *defaultdocs);
! int IsLocalURL(char *url);
!
  private:
      //
      // A hash to keep track of what we've seen
      //
--- 70,102 ----
      // Allow for the indexing of protected sites by using a
      // username/password
      //
! void setUsernamePassword(char *credentials);
  
      //
      // Routines for dealing with local filesystem access
      //
      StringList * GetLocal(char *url);
      StringList * GetLocalUser(char *url, StringList *defaultdocs);
! int IsLocalURL(char *url);
!
! // plp 000503 - for prune_parent_href feature
! void store_url(char *c_url);
! int url_is_parent_dir(char *c_url);
!
  private:
+
+ // plp 000503 - for prune_parent_href feature
+ typedef struct {
+ char url_store[MAX_CAN_URL_LEN];
+ char *url_store_chopped[MAX_CAN_URL_HOPS];
+ int hop_count; // the last valid index in url_store_chopped + 1 or zero
+ } ChoppedUrlStore;
+
+ ChoppedUrlStore gus; // Global chopped Url Store
+
+ void chop_url(ChoppedUrlStore &cus,char *c_url);
+ // /plp
+
      //
      // A hash to keep track of what we've seen
      //

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed May 10 2000 - 12:28:13 PDT