Re: htdig: virtual hosts revisited


Geoff Hutchison (ghutchis@wso.williams.edu)
Mon, 14 Dec 1998 15:02:58 -0500


At 5:28 AM -0500 12/14/98, Walter Hafner wrote:
>1) The lack of support for German umlauts ()

My suggestion would be to look at the locale option.

>2) The somewhat limited queries.

I think you'll have to be more specific. I'd say we easily cover the 80/20
rule. From my search logs, most people put in text.

>3) The unability to distinguish virtual hosts from mere CHAMEs.
>I think that ht://Dig could 'borrow' a simple yet clever method to solve
>problem (3). As I wrote, I evaluate alternatives to ht://Dig. Currently
>I have a look at Netscapes Compass Server. NCS gives the possibility
>for a "site probe". Here is a screen snipplet:

Actually NCS is being pretty naive in just using the size. The best way to
detect exact duplicates is with a checksum (e.g. md5sum). Since it's pretty
quick to generate a checksum, this isn't too slow.

Though checking the root documents for checksums to determine duplicate
servers is an interesting idea, my personal approach would be to add in
checksumming in general for HTTP transations and detect duplicate documents
no matter where they appear. There's a patch around to detect duplicate
files based on inodes for filesystem digging, but I hesitate to add it
before adding an HTTP version.

We have lots of links on our website and it's annoying to see duplicates in
search results. But the problem with duplicate detection is deciding which
duplicate to use! My current thought is to use the document with the lower
hopcount.

Does this make sense?

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:51 PST