htdig: virtual hosts revisited

Walter Hafner (
Mon, 14 Dec 1998 11:28:21 +0100 (MET)


I'm in the process of evaluating Webcrawler software for full-text
indexing purposes.

Currently we use ht://Dig 3.1.0b2 for indexing the whole
* domain. The domain consists of ~300 WWW Servers, that
answer to ~540 names (vitual hosts _and_ server aliases). All in all
there are ~130.000 documents to index. You can have a look at: (german).

This amount of data shows the limits of ht://Dig. :-)

While I'm quite happy with ht://Dig in general, there are a few things
that annoy me:

1) The lack of support for German umlauts ()
2) The somewhat limited queries.
3) The unability to distinguish virtual hosts from mere CHAMEs.

Unfortunately I don't know C++ at all, so I can't supply patches. If the
code was in C ...

I think that ht://Dig could 'borrow' a simple yet clever method to solve
problem (3). As I wrote, I evaluate alternatives to ht://Dig. Currently
I have a look at Netscapes Compass Server. NCS gives the possibility
for a "site probe". Here is a screen snipplet:


[x] Show advanced DNS information

Checking URL: Doing DNS lookup....

GetHostByName() results for '':
h_error: 0 - successful DNS query
length: 4

Result: is a valid name.
Note: appears to be an alias for the machine named .

Checking URL for Redirect: Trying to Connect to Site...
Result: No Server redirect detected at

Checking host for Virtual Server: Trying to Connect to Site...
Result: is really a virtual server being hosted on the server

To distinguish virtual hosts from server aliases, NCS simply contacts
the two addresses that were returned by "GetHostByName()":

gi.access_log: - - [11/Dec/1998:12:54:37 +0100] "GET / HTTP/1.0" 200 911

tum.access_log: - - [11/Dec/1998:12:54:37 +0100] "GET / HTTP/1.0" 200 2301

(NCS runs on "sunhalle1"). NCS simply compares the root documents of the
two addresses. If they are the same, the alias is _possibly_ a server
alias of the server, if they are different, the alias is a virtual host.

There might be some problems, if one machine hosts several virtual
hosts, but in general that's a feature I'd _love_ to see in
ht://Dig. The last time I checked, ht://Dig indexed ~280.000 documents
in our domain, where 130.000 is a more realistic number. The
"server_aliases" directive didn't help much either. There are simply way
too much hosts to dael manually with!

Any comments to my suggestion?

-Walter Hafner

Walter Hafner_______________________________
       <A href=>*CLICK*</A>
 The best observation I can make is that the BSD Daemon logo
 is _much_ cooler than that Penguin :-)   (Donald Whiteside)
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:51 PST