Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0

Subject: Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0
From: Geoff Hutchison (
Date: Tue May 09 2000 - 19:39:37 PDT

On Wed, 10 May 2000, Peter L. Peres wrote:

> Without the patch it looped, and the first URL after /usr/doc/javadoc/api
> (an Apache index) was /usr/doc/javadoc/ again, which is also an Apache
> index. I don't know what makes the loop happen.
> With the patch, it went into /usr/doc/javadoc/api, and did it exactly once
> over, and that's that. There are many more places like that in the
> documentation tree that I am trying to index. I'll provide a ls-1 or
> something of that subtree for you to see if you want.

This would be useful. Our guess (myself and Gilles) is that there is a
recursive softlink in your documentation tree. Still, I would like to see
the URL log and an ls -lR from your tree so I can take a look through both
and see what's going on. The patch is nice, but without more information,
it is hard to gain a deeper understanding. :-)

> Anyway, I think that if htdig was to index a certain part of a site, say,
> http://a/b/c then it should not climb links higher than its starting point

I think I've been through this before. The limit_urls_to attribute stops
this. It is (AFAIK) not possible to circumvent this in the current code.
Of course bugs have been discovered in the pattern matching code in the
past, but without your URL log, it is impossible for us to reproduce.

So I'll say it one more time. Please, *please* send me the URL log and an
ls -lR of your /usr/doc tree. I understand they'll be large, but I will be
willing to take a look. It would also be nice to have a copy of your
htdig.conf. This data will let us get a deeper understanding of your

> that appear under various names. I have an idea: why not md5sum each file,
> and refuse to look at it again if it hits the sum and perhaps the file
> size, regardless of the name (but if from the same host) ? I know that

Yup, this has been brought up again and again. It is almost assuredly a
good idea. Actually, I'd probably consider date a good ID as well. This is
one of the features holding up the 3.2 release. If you would be willing to
make an additional patch to Retriever to work on this, it would be
*greatly* appreciated. BTW, I would be quite surprised if md5sum failed.
It may not catch "near-duplicates," but it would be *very* improbable
to identify a false duplicate.

Host issues are a bit tricky. Someone proposed an "signature" method for
adding server aliases, but it also has not been tried. This would probably
need to be an option since it might misidentify "duplicate" servers.

-Geoff Hutchison
Williams Students Online

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Tue May 09 2000 - 17:27:06 PDT