Re: [htdig] db sizes


Subject: Re: [htdig] db sizes
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue Aug 08 2000 - 11:50:53 PDT


According to justin:
> I have got htdig running perfectly now. It is updating the index without
> re-reading all files:) The only problem I am having is that the db
> files are very large. These are the db files for ~600M of archived html
> mail:
>
> 591M db.docdb
> 591M db.docdb.work
> 11M db.docs.index
> 1.3G db.wordlist.work
> 1.6G db.words.db
> 4.1G total
>
> Will changing
> search_algorithm: exact:1 synonyms:0.5 endings:0.1
> to just exact:1 make the db any smaller?

No, for two reasons. First of all, only htsearch uses search_algorithm,
so changing it won't affect htdig. Secondly, the databases that htsearch
uses to support the synonyms and endings algorithms, which are generated
by htfuzzy, are relatively small, static files that aren't affected by
the words in your word database built by htdig and htmerge.

> I am also thinking the db are large not because of htdig but because of
> the email. I had used postal, a smtp benchmark to send the 600M of
> mail. Postal does not send english words but random ASCII garbage,
> Could this be why the db files are so large?

Well, indexing 600 MB of random ASCII garbage is not the way to get a
small, clean database. There are a few reasons why you have more than
600 MB of database, though. First of all, you have two copies of your
docdb database - the main one and the .work copy. Both of these will
contain the first "max_head_length" bytes of data from each document,
for excerpts in search results. Secondly, you have all your garbage
"words" in both db.wordlist.work and db.words.db (which is generated
from the former). Each "word" in these files will carry some overhead
as well, probably worsened by the fact that many of the words will be
quite small, so each of these files is bigger than the whole set of data
you indexed. So, overall, you have well over 4 times the size of the
original data.

In a real-life scenario, your overhead would likely be smaller.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Aug 08 2000 - 01:50:27 PDT