Overview of htdig: Indexing in General

by Geoff Hutchison Copyright © 2000 Geoffrey Hutchison

Let's start off with an overview of what happens during indexing. I'm obviously not going to go into gory detail, at least in this installment, but I'm going to give a rough overview including the classes called and the calling sequence. With the recent questions about performance in the 3.2 code, I'm going to mention some particular areas that I know are holding things up.

Obviously the indexing starts with htdig/htdig.cc, the main program. This has a variety of command-line options, but the basics remain the same. The program starts by loading the config file, setting up a few program-wide variables like the Configuration object and loading the initial list of URLs to be indexed. This list can now come from many sources, including a file specified on the command line with -m, STDIN, the traditional start_url, and any URLs left in the log file, if started with -l. (At this point, we may want to make the -l flag obsolete and make this the default behavior. After all, it seems many people expect they can kill the indexer and still have usable databases.)

This initial list is passed off to a new Retriever object, which runs the whole indexing process. The Retriever creates a Server object for each new server--this counts the first portion of the URL, including: scheme://user@host:port/ The Server object is responsible for keeping its own queue of URLs to be indexed, the current state of the Server and a few other features. When created, it first grabs the appropriate robots.txt file if the Server represents an HTTP or HTTPS protocol. Internally, the Servers keep the queue of URLs as an HtHeap, otherwise known as a priority queue. The priority is based first on the hopcount, ensuring that htdig performs a breadth-first search and accurately represents the hopcount of each document.

After the Retriever builds up its list of initial URLs, it enters into its Start() method, a loop until all active servers have emptied their queues of URLs. When it gets a URL, it first checks to see if it has it in the database already. If it does, it uses the date to send a Last-Modified-Since header to avoid wasting packets (and time!). If not, it creates a DocumentRef for it. Unfortunately, this check is SLOW--it must first look up the URL in the db.docs.index file and then retrieve the document itself. One way to speed up initial indexing would be to use an Exsits test to simply know if this is a new document. Also, keeping the URL->DocID list in memory somehow would greatly speed things up.

At this point, the Retriever passes control over to another singleton, the Document class. In some sense, this class serves as a jump-point for other classes. If the URL has a local_urls rule, the Retriever calls Document::RetrieveLocal, otherwise, the Retrieve method decides which Tranpsort object is appropriate based on the URL scheme. After the file is transferred, the Document class creates an appropriate Parsable class to parse the document. These Parsable classes report back to the Retriever, calling various got_*() methods, which the Retriever uses to fill in the appropriate fields in the file's DocumentRef.

Obviously there are two fairly important methods, got_word and got_href. The former sends a word off to the word database (which I'll skip for now) with flags set for the appropriate context of the word. The latter is obviously the key part of indexing. First off, the URL is normalized and checked against the various limits to make sure it's a valid URL to be indexed.

Next the URL is retrieved from the document DB again (remember that slow URL->DocID lookup?) and either a new record is created or the old document is updated with a new backlink. Finally, the method calls Need2Get to see if the document has been retrieved and pushes it to the appropriate server. The Need2Get method simply checks the URL against a hash table, but the visited hash table will obviously grow quite large--we should probably come up with a more efficient data structure and/or pass it off to disk.

Obviously the process repeats itself until all the URLs have been indexed. I'll pick up next time with an overview of htsearch, the flip side of this process.


Last modified: $Date: 2001/01/22 01:21:58 $