Re: htdig: virtual hosts revisited


Webmaster (webmaster@vallnet.com)
Mon, 14 Dec 1998 15:21:58 -0600


Earlier part of message deleted for brevities<sp> sake
>Actually NCS is being pretty naive in just using the size. The best way to
>detect exact duplicates is with a checksum (e.g. md5sum). Since it's pretty
>quick to generate a checksum, this isn't too slow.
>
>Though checking the root documents for checksums to determine duplicate
>servers is an interesting idea, my personal approach would be to add in
>checksumming in general for HTTP transations and detect duplicate documents
>no matter where they appear. There's a patch around to detect duplicate
>files based on inodes for filesystem digging, but I hesitate to add it
>before adding an HTTP version.
>
>We have lots of links on our website and it's annoying to see duplicates in
>search results. But the problem with duplicate detection is deciding which
>duplicate to use! My current thought is to use the document with the lower
>hopcount.
>
>Does this make sense?
>
>-Geoff Hutchison
>Williams Students Online
>http://wso.williams.edu/
shortest hopcount is somewhat reasonable...
also might want to use the one with the shortest URL(or is this what you
mean by shortest hopcount?),
or maybe put in some kind of 'server ranking'...
for example:
personal.vallnet.com, newt.vallnet.com, and tip.vallnet.com are the same
machine (long story)
we would prefer personal webpages 'search' as located on
personal.vallnet.com..
so it would be nice to be able to give pages located on personal.vallnet.com
'precedence' over newt.vallnet.com
(In the name of convenience for the people who's pages are on the server we
need to have personal, newt, and tip for those using absolute url's in their
webpages, and htdig needs to follow them for thoroughness sake.)

    Eric Esslinger
        Webmaster, Valley Internet Inc, Tnco Internet llc.

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:51 PST