Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0


Subject: Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Thu May 11 2000 - 16:15:09 PDT


At 2:38 PM -0500 5/11/00, Gilles Detillieux wrote:
>Actually, it might make sense to have (optionally) cross-host duplicate
>elimination too. This would be implemented by using only the MD5 sum
>as a lookup key, rather than a server name and MD5 sum combination.

Yes. My suggestion is to take it one step at a time. :-)

Let's get an MD5 sum technique and try it out and see if it does
eliminate what we want. Then we can go from there.

>htdig does hold on to some HTTP header info, but this is separate from
>the file contents. Of course, the md5sum would be only on the contents

I'll simply note that in 3.2, you have more than just HTTP protocols
active. But the point is the same--the headers are already removed.

> > The second step will be to augment the code of the first step, with a
> > 'replace-href-in-db-if-lower-hop-count' rule, that will apply to 3.2 code
> > (I intend to do the first step on 3.1.5).
> >
> > So, do you have any pointers on this ?
>
>Not beyond what I said before. Geoff seemed to suggest this isn't very
>feasible, and I'd trust his judgement on this issue. As he said, 3.2
>should reach the lower hop count path first, but this may be different
>than the lowest nested directory count. This may be too much effort for
>too little gain, as you may find that the first encountered URL is OK
>in almost all cases.

At the moment, I'm not sure the current CVS code ensures the lower
hopcount, but this is a bug I have to deal with. Let's just say I
will guarantee that the final version of 3.2.0 will uphold hopcount.
So if you're parsing a document, you can be sure there isn't a
document with lower hopcount further along.

So my thought is that after retrieving a document, if the md5sum
matches that of a document you've seen before, you stop indexing it.
The code for this should be in the Retriever and should essentially
just mark the duplcate URL as _notfound or _noindex.

(Any additional discussion on this is probably best moved to the
htdig3-dev mailing list.)

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu May 11 2000 - 14:14:42 PDT