htdig: virtual hosts revisited


Walter Hafner (hafner@informatik.tu-muenchen.de)
Mon, 14 Dec 1998 11:28:21 +0100 (MET)


Hi!

I'm in the process of evaluating Webcrawler software for full-text
indexing purposes.

Currently we use ht://Dig 3.1.0b2 for indexing the whole
*.tu-muenchen.de domain. The domain consists of ~300 WWW Servers, that
answer to ~540 names (vitual hosts _and_ server aliases). All in all
there are ~130.000 documents to index. You can have a look at:
http://tum-index.ze.tu-muenchen.de/ (german).

This amount of data shows the limits of ht://Dig. :-)

While I'm quite happy with ht://Dig in general, there are a few things
that annoy me:

1) The lack of support for German umlauts ()
2) The somewhat limited queries.
3) The unability to distinguish virtual hosts from mere CHAMEs.

Unfortunately I don't know C++ at all, so I can't supply patches. If the
code was in C ...

I think that ht://Dig could 'borrow' a simple yet clever method to solve
problem (3). As I wrote, I evaluate alternatives to ht://Dig. Currently
I have a look at Netscapes Compass Server. NCS gives the possibility
for a "site probe". Here is a screen snipplet:

------------------------------------------------------------
Site: http://gi.vo.tum.de:80/

[x] Show advanced DNS information

Checking URL: Doing DNS lookup....

GetHostByName() results for 'gi.vo.tum.de':
h_error: 0 - successful DNS query
Name: w3proj1.ze.tu-muenchen.de
aliases: gi.vo.tum.de
addrtype:2
length: 4
ip: 129.187.102.4

Result: gi.vo.tum.de is a valid name.
Note: gi.vo.tum.de appears to be an alias for the machine named w3proj1.ze.tu-muenchen.de .

Checking URL for Redirect: Trying to Connect to Site...
Result: No Server redirect detected at http://gi.vo.tum.de:80/

Checking host for Virtual Server: Trying to Connect to Site...
Result: http://gi.vo.tum.de:80/ is really a virtual server being hosted on the server
w3proj1.ze.tu-muenchen.de.
------------------------------------------------------------

To distinguish virtual hosts from server aliases, NCS simply contacts
the two addresses that were returned by "GetHostByName()":

============================================================
gi.access_log:
sunhalle1.informatik.tu-muenchen.de - - [11/Dec/1998:12:54:37 +0100] "GET / HTTP/1.0" 200 911

tum.access_log:
sunhalle1.informatik.tu-muenchen.de - - [11/Dec/1998:12:54:37 +0100] "GET / HTTP/1.0" 200 2301
============================================================

(NCS runs on "sunhalle1"). NCS simply compares the root documents of the
two addresses. If they are the same, the alias is _possibly_ a server
alias of the server, if they are different, the alias is a virtual host.

There might be some problems, if one machine hosts several virtual
hosts, but in general that's a feature I'd _love_ to see in
ht://Dig. The last time I checked, ht://Dig indexed ~280.000 documents
in our domain, where 130.000 is a more realistic number. The
"server_aliases" directive didn't help much either. There are simply way
too much hosts to dael manually with!

Any comments to my suggestion?

-Walter Hafner

-- 
Walter Hafner_______________________________ hafner@in.tum.de
       <A href=http://www.tum.de/~hafner/>*CLICK*</A>
 The best observation I can make is that the BSD Daemon logo
 is _much_ cooler than that Penguin :-)   (Donald Whiteside)
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:51 PST