Re: htdig: Re: htdig 4


Geoff Hutchison (Geoffrey.R.Hutchison@williams.edu)
Wed, 08 Jul 1998 08:03:10 -0400 (EDT)


I noticed through CVSweb that you now have the htdig.org pages through CVS
and appear to have CVS up and running. It seems to work since I could
checkout htdig4 and htdig3, though I couldn't checkout the web pages. If I
get a chance, I'll add a start at a FAQ.html file from the mailing list
archives.

Nevertheless, I'll check in some small patches to htdig3 to add support
for the META description tag and the new META robots spec. I'm also
looking at the old Mosaic/X client which supposedly supports compressed
transfers. I know Apache has support on the client end, so it might be
nice to test if compressed transfers improve speed.

I also remembered the discussion on the mailing list about checksums to
detect duplicate files. I think I now know the fastest way to do this
(i.e. few checksums):
1) If the URL is unique (i.e. we _think_ we don't have the file), grab it.
2) If the _size_ of the document is exactly equal to something we have,
checksum the file. (If the doc in the DB doesn't have a checksum yet, do
that and store for future use.)
3) If the checksums are equal, add the URL to a duplicate_url field.
4) Else procede as usual.

Since Java 1.1.x can do md5 checksums easily, this is the ideal checksum.
Since taking a checksum is faster than parsing the document, we'd prefer
to do this, but would still like to take very few checksums. I'm assuming
a mirror would make a duplicate file with _different_ dates. If we don't
care about this, ensuring the docs have different dates first would
significantly reduce the number of checksums (since it's unlikely to have
duplicate timestamps to the second).

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:51 PST