Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0


Subject: Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue May 09 2000 - 15:14:09 PDT


According to Peter L. Peres:
> + 1.1 The problem:
> +
> + When on an open system (ex: Linux) used on an intranet (no direct connection
> + to the Internet), documentation is added to the HTML DocumentRoot tree, by
> + adding symbolic links to the documentation under the DocumentRoot, and htdig
> + is used to index this information, then htdig (3.1.5) will enter an endless
> + loop or try to index the entire system.
> +
> + It does this by reaping the url of the 'parent directory' in
> + Apache-generated indexes of directories (such as, of the directories that
> + are soft-linked under the DocumentRoot). The 'parent directories' of a
> + sirectory entered by a symbolic link, leads back all the way to root '/'. If
> + the patch is not applied, then htdig will try to index the entire system,
> + and may loop if any cross-linking exists.
...
> + then recompile and reinstall the htdig (make; make install). Edit the config
> + file to turn on the new option, add a symbolic link to the DocumentRoot
> + (f.ex. cd /usr/local/httpd/htdocs/misc; ln -s /usr/doc .; on Suse systems),
> + and run htdig (rundig).

I'm still having a great deal of trouble envisioning how htdig can follow
symbolic links in such a way as to make the entire file system visible,
unless it finds a symbolic link to the root directory of the file system.
To use your example, the link /usr/local/httpd/htdocs/misc/doc -> /usr/doc
would make the whole /usr/doc sub-tree appear under the URL
http://localhost/misc/doc/, but when you follow the parent directory link,
it should take you to http://localhost/misc/, not to /usr! What would an
URL that leads to /usr or / look like, given that URLs are supposed to be
relative to the DocumentRoot?

I do understand, though, how cross links could lead to a file-system loop
when all symbolic links are followed, so I suspect that was the source of
your problem. Still, I can envision situations where mutual cross links
between two subtrees could lead to infinite loops even without following
up links. Essentially, the spider would be constantly descending deeper
into a hierarchy that doesn't end, because the backward links are concealed
as downward links. It's a theoretical possibility, anyway.

> + To see the patch working, run htsearch with -v. The patch causes a bang
                                  ^^^^^^^^
I assume you mean htdig here.

> diff -rcN tmp/htdig-3.1.5/htdig/HTML.cc htdig-3.1.5/htdig/HTML.cc
> *** tmp/htdig-3.1.5/htdig/HTML.cc Fri Feb 25 04:29:10 2000
> --- htdig-3.1.5/htdig/HTML.cc Mon May 4 01:11:01 1998
> ***************
> *** 394,400 ****
> head << word;
> }
>
> ! if (word.length() >= minimumWordLength && doindex)
> {
> retriever.got_word(word,
> int(offset * 1000 / totlength),
> --- 394,400 ----
> head << word;
> }
>
> ! if ((word.length() >= (unsigned)minimumWordLength) && doindex)
> {
> retriever.got_word(word,
> int(offset * 1000 / totlength),

What does this change do? Were you getting warnings before?

> + void
> + Retriever::chop_url(ChoppedUrlStore &cus,char *c_url)
> + {
> + int l;
> +
> + cus.url_store[0] = '\0';
> + cus.hop_count = 0;
> + l = strlen(c_url);
> + if((l == 0) || (l > MAX_CAN_URL_LEN)) {

You'll overrun the end of url_store if l == MAX_CAN_URL_LEN. Remember the
null terminator.

> + if(debug > 0)
> + cout << "chop_url: failed on len==0\n";
> + return;
> + }
> + strcpy(cus.url_store,c_url);
> + l = 0;
> + if((cus.url_store_chopped[l++] = strtok(cus.url_store,"/")) == NULL) {
> + cus.url_store[0] = '\0';
> + if(debug > 0)
> + cout << "chop_url: failed on NULL with " << c_url << "\n";
> + return;
> + }
> + while((cus.url_store_chopped[l++] = strtok(NULL,"/")) != NULL) {
> + if(l > MAX_CAN_URL_HOPS) {
> + cus.url_store[0] = '\0';
> + return; // fail silently with a valid url, print a bang somewhere else
> + }
> + }
> + cus.hop_count = l - 1;
> + return; // success
> + }
> +
> + // call this function to store the base URL of a document being indexed,
> + // when starting to index it (in HTML::parse or ExternalParser::parse)
> + void
> + Retriever::store_url(char *c_url)
> + {
> + chop_url(gus,c_url);
> + return;
> + }
> +
> + // call this function to decide if a reaped URL is a direct parent of
> + // the URL being indexed. call in Retriever::got_href()
> + int
> + Retriever::url_is_parent_dir(char *c_url)
> + {
> + int j,k;
> + ChoppedUrlStore cus;
> +
> + if(gus.hop_count == 0)
> + return 0;
> +
> + chop_url(cus,c_url);
> + if(cus.hop_count == 0)
> + return 0;
> +
> + // seek a matching last part, backwards
> + j = gus.hop_count - 1;
> + k = cus.hop_count - 1;
> + while(strcmp(gus.url_store_chopped[j],cus.url_store_chopped[k]) != 0)
> + if(--j < 0)
> + return 0; // not

What if a path component is repeated, e.g. /files/doc/html/doc/foo.html?
It seems this code could get confused by the repeated name, which could
cause a false match at the lower directory level.

> + while((--j >= 0)&&(--k >= 0))
> + if(strcmp(gus.url_store_chopped[j],cus.url_store_chopped[k]) != 0)
> + return 0; // not
> + return 1; // yes
> + }

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue May 09 2000 - 13:01:46 PDT