[htdig3-dev] Re: Architecture overview


Subject: [htdig3-dev] Re: Architecture overview
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Sat Feb 26 2000 - 12:45:19 PST


I promised that after 3.2.0b1, I'd start writing some architecture
overviews. This is the first installment--I'll post them to the list
and then add them to dev.htdig.org. I hope other developers will
write a few themselves, and I'd love feedback on these--are they too
broad, too specific, should I use code examples, etc.?

-Geoff

Overview of htdig: Indexing in General

Let's start off with an overview of what happens during indexing. I'm
obviously not going to go into gory detail, at least in this
installment, but I'm going to give a rough overview including the
classes called and the calling sequence. With the recent questions about
performance in the 3.2 code, I'm going to mention some particular areas
that I know are holding things up.

Obviously the indexing starts with htdig/htdig.cc, the main
program. This has a variety of command-line options, but the basics
remain the same. The program starts by loading the config file,
setting up a few program-wide variables like the Configuration object
and loading the initial list of URLs to be indexed. This list can now
come from many sources, including a file specified on the command line
with -m, STDIN, the traditional start_url, and any URLs left in the
log file, if started with -l. (At this point, we may want to make the -l
flag obsolete and make this the default behavior. After all, it seems many
people expect they can kill the indexer and still have usable databases.)

This initial list is passed off to a new Retriever object, which runs the
whole indexing process. The Retriever creates a Server object for each
new server--this counts the first portion of the URL, including:
scheme://user@host:port/
The Server object is responsible for keeping its own queue of URLs to
be indexed, the current state of the Server and a few other
features. When created, it first grabs the appropriate robots.txt file
if the Server represents an HTTP or HTTPS protocol. Internally, the
Servers keep the queue of URLs as an HtHeap, otherwise known as a
priority queue. The priority is based first on the hopcount, ensuring
that htdig performs a breadth-first search and accurately represents
the hopcount of each document.

After the Retriever builds up its list of initial URLs, it enters into
its Start() method, a loop until all active servers have emptied their
queues of URLs. When it gets a URL, it first checks to see if it has
it in the database already. If it does, it uses the date to send a
Last-Modified-Since header to avoid wasting packets (and time!). If
not, it creates a DocumentRef for it. Unfortunately, this check is SLOW--it
must first look up the URL in the db.docs.index file and then retrieve the
document itself. One way to speed up initial indexing would be to use an
Exsits test to simply know if this is a new document. Also, keeping the
URL->DocID list in memory somehow would greatly speed things up.

At this point, the Retriever passes control over to another singleton,
the Document class. In some sense, this class serves as a jump-point
for other classes. If the URL has a local_urls rule, the Retriever
calls Document::RetrieveLocal, otherwise, the Retrieve method decides
which Tranpsort object is appropriate based on the URL scheme. After
the file is transferred, the Document class creates an appropriate
Parsable class to parse the document. These Parsable classes report
back to the Retriever, calling various got_*() methods, which the
Retriever uses to fill in the appropriate fields in the file's
DocumentRef.

Obviously there are two fairly important methods, got_word and got_href.
The former sends a word off to the word database (which I'll skip for now)
with flags set for the appropriate context of the word. The latter is
obviously the key part of indexing. First off, the URL is normalized and
checked against the various limits to make sure it's a valid URL to be
indexed.

Next the URL is retrieved from the document DB again (remember that slow
URL->DocID lookup?) and either a new record is created or the old document
is updated with a new backlink. Finally, the method calls Need2Get to see
if the document has been retrieved and pushes it to the appropriate server.
The Need2Get method simply checks the URL against a hash table, but the
visited hash table will obviously grow quite large--we should probably come
up with a more efficient data structure and/or pass it off to disk.

Obviously the process repeats itself until all the URLs have been indexed.
I'll pick up next time with an overview of htsearch, the flip side of this
process.

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Sat Feb 26 2000 - 12:58:06 PST