Subject: Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0
From: Peter L. Peres (
Date: Wed May 10 2000 - 15:17:15 PDT


the htdig will reap the first URL from an Apache index, and push it first.
The next document indexed on that server, will be the parent directory.
Thus, not only did htdig index almost everything (by climbing the parent
links first), but it got to index the place it was pointed to *last*. In
other words, with the javadoc, it would get the index for doc/javadoc/api,
then /doc/javadoc, then /doc (by this time it was looking at indexing
about 3/4 of my total disk space). You get the idea.

I do not have a logged output because it got obscenely large and
I removed all copies, and made sure that I won't be making any more of

I may be exaggerating a bit when saying that 'it was indexing the whole
filesystem' but consider that in my case it was indexing about 3/4 of my
total disk capacity, which is near enough 'all' for me. BTW the ls-lR I
sent you represents only one side of the problem. I have even more
soft-linked directories that are Apache indexes.

The idea to run find first and use a list is good, but I want to get rid
of the management problem, because I keep mounting and unmounting things I
work with, and it would be impossible to keep track of the lists of urls
to prune.

Wrt. the checksum and 'not choosing the best access for a file', I think
that htdig ought to index the document when first found, and file the
canonical url away (as it does), then if a 'same' document with a shorter
path (counted in components) is found, the filename of the already indexed
file will be replaced with the shorter one, and the long one dropped. This
would guarantee that the shortest path to the file is used for access
(which ought to be the one with the least symbolic links resulted from
cross-linking and such). What do you think ?


PS: I have no idea if one can remove and replace a already stored and
indexed href. How would I go about doing that.

