Re: [htdig] Databases -- Read-access modules. (3.1.5)


Subject: Re: [htdig] Databases -- Read-access modules. (3.1.5)
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Tue Mar 21 2000 - 11:00:11 PST


On Tue, 21 Mar 2000 Sphboc@aol.com wrote:

> db.words.db
> db.docdb
> db.docs.index
>
> Presumably, these are in some fairly-standard database format; if I could
> determine what this is, and obtain field lists, it would be a major step
> forward.

You'll be *much* happier parsing db.wordlist for the word database, which
is an ASCII file. You'll also be much happier using the -t flag for htdig
and parsing the resulting db.docs text file.

Both files have records separated by \n characters and fields separated by
tabs with field labels before each field (label:field)

The wordlist format is:
word <tab> i:DocID <tab> l:location <tab> w:weight <tab> c:count <tab> a:anchor

Note that count and anchor are optional and are dropped if they're the
default.

The fields in the db.docs are a bit more complex, but if you're willing to
read the source, they're in DocumentDB.cc under "CreateSearchDB" with the
key fields being the DocID and the URL (the first two).

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Mar 21 2000 - 09:57:52 PST