Re: htdig: virtual hosts revisited


Walter Hafner (hafner@informatik.tu-muenchen.de)
Tue, 15 Dec 1998 11:28:20 +0100 (MET)


Geoff Hutchison writes:
> At 5:28 AM -0500 12/14/98, Walter Hafner wrote:
> >1) The lack of support for German umlauts ()
>
> My suggestion would be to look at the locale option.

Oops, sorry. I stand corrected. Missed that one.

> >2) The somewhat limited queries.
>
> I think you'll have to be more specific. I'd say we easily cover the 80/20
> rule. From my search logs, most people put in text.

I'd like to have real substring search and case sensitive search. And
while I'm dreaming, a regexp subset would be nice. :-)

Simple prefix search sometimes is just too restricted. Imagine the words
"prefix-search" vs. "prefix search": In the indexing step you'll end
with the database entries "prefixsearch" vs "prefix" and "search",
depending on the settings of valid_punctuations of course.

> >3) The unability to distinguish virtual hosts from mere CHAMEs.
> >I think that ht://Dig could 'borrow' a simple yet clever method to solve
> >problem (3). As I wrote, I evaluate alternatives to ht://Dig. Currently
> >I have a look at Netscapes Compass Server. NCS gives the possibility
> >for a "site probe". Here is a screen snipplet:
>
> Actually NCS is being pretty naive in just using the size. The best way to
> detect exact duplicates is with a checksum (e.g. md5sum). Since it's pretty
> quick to generate a checksum, this isn't too slow.

I don't know, what algorithm NCS uses. I just did a "site probe" and
noticed accesses for both names (actual and alias). I have no idea, what
NCS does with this information. However, I think you're right in
suggesting MD5 checksums (I'm a FreeBSD admin, after all ... :-)

> Though checking the root documents for checksums to determine duplicate
> servers is an interesting idea, my personal approach would be to add in
> checksumming in general for HTTP transations and detect duplicate documents
> no matter where they appear. There's a patch around to detect duplicate
> files based on inodes for filesystem digging, but I hesitate to add it
> before adding an HTTP version.

That would be great, of course. As I wrote already: I don't know C++,
but I imagine that holding checksums for ~130.000 URLs (in my case)
results in HUGE memory consumption. hd://Dig 3.1.0b2 already wants 120
MB on my machine. :-)

> We have lots of links on our website and it's annoying to see duplicates in
> search results. But the problem with duplicate detection is deciding which
> duplicate to use! My current thought is to use the document with the lower
> hopcount.
>
> Does this make sense?

As I wrote in another Mail: Why not use the lower hopcount, _unless_ the
name is explicitely stated in the server_aliases ?

Regards,

-Walter

-- 
Walter Hafner_______________________________ hafner@in.tum.de
       <A href=http://www.tum.de/~hafner/>*CLICK*</A>
 The best observation I can make is that the BSD Daemon logo
 is _much_ cooler than that Penguin :-)   (Donald Whiteside)
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:51 PST