Re: [htdig] i need help on htdig database format


Subject: Re: [htdig] i need help on htdig database format
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Nov 25 1999 - 12:39:25 PST


According to ronald:
> when htdig exports results from an index as textformat it generates two
> files. The files look like this :
>
> file1:
> 0 u:http://www.htdig.org/ t:ht://Dig -- Internet search engine software a:0 m:936027636 s:373 h: h: l:940510479 L:2 I:373 d:http://www.htdig.org/\1www.htdig.org\1ht://Dig Search Software (yes, the developers use it)ht://DigParent Directory A:

First field: doc ID
u: URL of doc
t: doc title
a: doc state (refer to source)
m: date/time last modified, sec since 1970-01-01 00:00:00 UTC
s: doc size in bytes
h: doc head (excerpt of first max_head_length bytes of doc)
h: (2nd) meta description contents
                (this 2nd h is a bug - it really should be a unique value
                 like D or something)
l: date/time document was indexed (sec since 1970)
L: no. of links doc has to other docs
I: "docImageSize" - has nothing to do with images, but seems to
                contain document size, and may be cumulative in some
                circumstances - can anyone else make any sense of this?
d: link descriptions - text of links to this doc, ^A separated
A: anchor names (bookmarks) in doc, ^A separated

All fields are tab (^I) separated. Sub-fields of d & A use ^A separator.
doc head field has all runs of white space (space, tab, newline, etc.)
collapsed to single spaces.

> file2:

This is db.wordlist...

> 01oct99 i:115 l:0 w:100998 c:2
> 01oct99 i:116 l:0 w:100998 c:2
> 01oct99 i:45 l:6 w:100381 c:2
> 01oct99 i:46 l:0 w:100998 c:2
> 02aug1999 i:48 l:361 w:639 a:2
> 02jun1999 i:50 l:262 w:1382 c:2 a:2
> 02mar1999 i:53 l:378 w:622 a:2
> 02may1999 i:51 l:280 w:1349 c:2 a:2

First field: indexed word (lower case)
i: doc ID (to match up with records from above)
l: location of word in doc (0-1000, i.e. tenth of a percent units)
w: weight of word in searches
c: no. of occurrences of word in document, if > 1
a: index into "A:" list above, to indicate which anchor name,
                if any, preceded this word

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You'll receive a message confirming the unsubscription.



This archive was generated by hypermail 2b25 : Thu Nov 25 1999 - 12:51:20 PST