Geoff Hutchison (ghutchis@wso.williams.edu)
Sun, 11 Jul 1999 01:00:03 -0400 (EDT)
OK, I have the beginnings of the new word database code on my drive. I
haven't updated htmerge or htsearch yet, so I'm not going to commit it to
the tree just yet. Hopefully I'll have time to do that tomorrow.
The key benefits of this code are that no sorting is needed, every word of
every document is indexed with location (for phrase searching), and the
databases don't require a separate merge phase to prepare them for
searching. (Hopefully you could dig on live databases, but without the
updated htsearch, I can't really test that ;-)
Here are the stats on the database sizes for indexing the first 100 pages
of www.htdig.org. I don't have times, but the 3.2 prototype feels
significantly slower. I hope that's just the difference between compiling
with -g and -O3, but I'll take a look for performance problems tomorrow...
Digging (and merging) with 3.1.2:
-rw-rw-r-- 1 ghutchis ghutchis 1591296 Jul 11 00:41 db.docdb
-rw-rw-r-- 1 ghutchis ghutchis 8192 Jul 11 00:41 db.docs.index
-rw-rw-r-- 1 ghutchis ghutchis 846477 Jul 11 00:41 db.wordlist
-rw-rw-r-- 1 ghutchis ghutchis 1052672 Jul 11 00:41 db.words.db
Total (K): 3436
Total w/o wordlist (K): 2604
Digging with 3.2 prototype:
-rw-rw-r-- 1 ghutchis ghutchis 687104 Jul 11 00:39 db.docdb
-rw-rw-r-- 1 ghutchis ghutchis 328704 Jul 11 00:39 db.docs.index
-rw-rw-r-- 1 ghutchis ghutchis 583680 Jul 11 00:39 db.excerpts
-rw-rw-r-- 1 ghutchis ghutchis 394240 Jul 11 00:39 db.words.db
Total (K): 1777
(deleting db.docs.index is possible, but not a big savings)
I'm rather surprised by this. I thought that storing every word would
bloat the word db... Instead, it's about 40% the size of the original! I'm
hoping I don't have a blatant bug, but my guess is that the database can
compress the separate words more efficiently since each record is shorter
(remember, the previous version used a list of document ID/weights as each
record).
I hope to wrap this up quickly so we can start hammering on it and looking
for performance problems. If this isn't the right direction, we need to
decide that soon.
-Geoff
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Sat Jul 10 1999 - 21:17:37 PDT