Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0

Subject: Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0
From: Gilles Detillieux (
Date: Thu May 11 2000 - 08:56:36 PDT

According to Peter L. Peres:
> the htdig will reap the first URL from an Apache index, and push it first.
> The next document indexed on that server, will be the parent directory.
> Thus, not only did htdig index almost everything (by climbing the parent
> links first), but it got to index the place it was pointed to *last*. In
> other words, with the javadoc, it would get the index for doc/javadoc/api,
> then /doc/javadoc, then /doc (by this time it was looking at indexing
> about 3/4 of my total disk space). You get the idea.

OK, so everything under /usr/doc, either physically or virtually
via symlink, was amounting to 3/4 of the disk space on your
system. That makes sense to me now. If you wanted to limit the
indexing to /doc/javadoc, though, why not just set limit_urls_to
to and not add anything else to
limit_urls_to, so that you wouldn't get all the other stuff you didn't
want? If I recall from the htdig.conf you posted yesterday (or was it
Tuesday?), you had limit_urls_to set to accept any URL that contained
your host name.

> The idea to run find first and use a list is good, but I want to get rid
> of the management problem, because I keep mounting and unmounting things I
> work with, and it would be impossible to keep track of the lists of urls
> to prune.

Sure, but if you know at any indexing run what you want to index (as
opposed to everything you want to leave out), then it's usually pretty
easy to specify that explicitly.

> Wrt. the checksum and 'not choosing the best access for a file', I think
> that htdig ought to index the document when first found, and file the
> canonical url away (as it does), then if a 'same' document with a shorter
> path (counted in components) is found, the filename of the already indexed
> file will be replaced with the shorter one, and the long one dropped. This
> would guarantee that the shortest path to the file is used for access
> (which ought to be the one with the least symbolic links resulted from
> cross-linking and such). What do you think ?
> Peter
> PS: I have no idea if one can remove and replace a already stored and
> indexed href. How would I go about doing that.

I think that sort of pathname optimisation for duplicate elimination
would be an excellent idea. I'm just not sure how you'd implement it.
The 3.1.x series uses a different database structure, so anything you'd
work out for it wouldn't transport easily to 3.2. As 3.1.x is strictly in
maintenance mode now, I'd recommend that if you're serious about this, you
implement it for 3.2 only. There, the db.docdb database has a field for
each file that contains the URL, so it should be possible to change that
field. The tricky part is that there is also the database,
which maps URLs to DocIDs, which are used as keys for the db.docdb. So,
you'd need to remove the URL entry for the old URL in,
and add a new entry to map the preferred URL to the existing DocID.

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Thu May 11 2000 - 06:44:16 PDT