Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0

Subject: Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0
From: Geoff Hutchison (
Date: Wed May 10 2000 - 16:01:41 PDT

At 1:17 AM +0300 5/11/00, Peter L. Peres wrote:
>the htdig will reap the first URL from an Apache index, and push it first.
>The next document indexed on that server, will be the parent directory.
>Thus, not only did htdig index almost everything (by climbing the parent

Right, but as we've pointed out *several* times, this only happens if
you've set your limit_urls_to attribute to a very liberal value. If
it's limited to a set of directories, it won't climb up.

>The idea to run find first and use a list is good, but I want to get rid
>of the management problem, because I keep mounting and unmounting things I
>work with, and it would be impossible to keep track of the lists of urls
>to prune.

I don't think this would be a problem. Simply run the find command in
your script before you call htdig. It will generate a
"fresh" list based on the volumes mounted (and any additional changes
you have made).

>Wrt. the checksum and 'not choosing the best access for a file', I think
>that htdig ought to index the document when first found, and file the
>canonical url away (as it does), then if a 'same' document with a shorter

My feeling is the metric should be based on hopcount. This seems the
fairest way to evaluate which URL should be the canonical one since
this is essentially a count of clicks from the start_url. So the URL
with the fewest clicks wins. This will probably roughly correspond to
your "number of component" metric as well.

In the 3.2 tree, URLs are indexed by hopcount, so you would only want
to store the URL for the first time you see a document.

It is not easy to "replace" the URL in the database. There are a few
pieces that need to be updated carefully. This is another reason for
sticking to the first URL...

-Geoff Hutchison
Williams Students Online

