Subject: Re: [htdig] db sizes
From: Gilles Detillieux (firstname.lastname@example.org)
Date: Tue Aug 08 2000 - 11:50:53 PDT
According to justin:
> I have got htdig running perfectly now. It is updating the index without
> re-reading all files:) The only problem I am having is that the db
> files are very large. These are the db files for ~600M of archived html
> 591M db.docdb
> 591M db.docdb.work
> 11M db.docs.index
> 1.3G db.wordlist.work
> 1.6G db.words.db
> 4.1G total
> Will changing
> search_algorithm: exact:1 synonyms:0.5 endings:0.1
> to just exact:1 make the db any smaller?
No, for two reasons. First of all, only htsearch uses search_algorithm,
so changing it won't affect htdig. Secondly, the databases that htsearch
uses to support the synonyms and endings algorithms, which are generated
by htfuzzy, are relatively small, static files that aren't affected by
the words in your word database built by htdig and htmerge.
> I am also thinking the db are large not because of htdig but because of
> the email. I had used postal, a smtp benchmark to send the 600M of
> mail. Postal does not send english words but random ASCII garbage,
> Could this be why the db files are so large?
Well, indexing 600 MB of random ASCII garbage is not the way to get a
small, clean database. There are a few reasons why you have more than
600 MB of database, though. First of all, you have two copies of your
docdb database - the main one and the .work copy. Both of these will
contain the first "max_head_length" bytes of data from each document,
for excerpts in search results. Secondly, you have all your garbage
"words" in both db.wordlist.work and db.words.db (which is generated
from the former). Each "word" in these files will carry some overhead
as well, probably worsened by the fact that many of the words will be
quite small, so each of these files is bigger than the whole set of data
you indexed. So, overall, you have well over 4 times the size of the
In a real-life scenario, your overhead would likely be smaller.
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Tue Aug 08 2000 - 01:50:27 PDT