Re: htdig: DB2 problem

Geoff Hutchison (
Mon, 16 Nov 1998 16:46:41 -0500

At 3:09 PM -0500 11/16/98, Jeff Hill wrote:
>I can't remember, but it seems larger by 50MB or so (could be we just
>keep adding so much). I am, however, running htdig-3.1.0b2, installed
>Nov. 6.

Two possibilities for larger DB: 1) You're adding more (I have several
mailing list archives that grow exponentially). 2) The DB bug was hiding
the actual size of your data.

>So, your dbs are actually slightly larger than your document base? Well,
>if htdig didn't fail, I suppose mine might be slightly larger too,
>although it should still have enough space.

This depends significantly on the max_head_length you use (i.e. the size of
the excerpts you store). When I get pinched for disk space, I cut this down.

>Am I right in assuming that running "htdig -i -v -s" isn't creating a
>temporary set of databases and then writing them to the db directory?
>Because if it did, I'ld need over 500MB free on the hard disk, and I
>wouldn't have that much space free.

I don't believe htdig does this. On the other hand, htmerge uses temporary
sets plus sort files. :-(

>> It's true, with a badwords list where I put in all meaningless words I was
>> able to spot using contrib/wordfreq/. That almost halved database size.
>I'll have to take a look at that, thanks.

I can't attest to halving, but it does help. I didn't use wordfreq, but I
used "cut -f 1 db.wordlist | uniq -c | sort -r" to determine how many
documents each word was in, then I took the top 500 and edited the list.

-Geoff Hutchison
Williams Students Online

To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:49 PST