[htdig] htmerge/rundig functionality & disk space usage

Subject: [htdig] htmerge/rundig functionality & disk space usage
From: Susan Alderman (Susan_Alderman@brown.edu)
Date: Wed Dec 15 1999 - 15:15:38 PST

Hi -

As a newbie with htdig, I'm looking for some confirmation and clarification
of how htmerge/rundig work, and some assistance battling low disk space
I'm afraid I've packed in a lot of questions here - most of the answers
should be
yes/no, I think.

Here's what I want to do: run htdig against one webserver
(call it main) once a day, and run htdig against some other webservers
once a week, merging all these results into one database. So,
with the appropriate flags, here's the plan (and my understanding of
how this works):

htdig -c main.conf (main webserver goes into database)
htmerge -c main.conf (indexing of main webserver data)
htdig -c sub1.conf (dig of webserver sub1)
htmerge -m main.conf -c sub1.conf (merge sub1 into main database)

[the next night]

htdig -c main.conf
htmerge -c main.conf [Is sub1 still in the main database?]
htdig -c sub2.conf
htmerge -m main.conf -c sub2.conf [Now I have 3 servers in
                                        the main index, right?]

[and so on and so forth]

I had had rundig set up to index the main server and the subsidiary servers
every night, but I've been running out of disk space (corrupted indices, all
sorts of ugliness has resulted). My thinking here is that in spreading out
the indexing of the subsidiary webservers, I may reduce the quantity of disk
space required for the merging. Am I out to lunch here - is there
something I'm missing?

The standard rundig has some (to me) cryptic messages about saving disk space:

># If you're low on disk space and you don't mind completely reindexing
># every time you run this script, you can always
># rm $DBDIR/db.wordlist

BUT - does this mean that no one can run a search query while I'm reindexing?

># OR
># If you'd rather run update digs all the time with the minimal databases
># Keep only the following files (and don't call htdig with -i):
># db.docdb, db.docdb.work, db.docs.index, db.wordlist.work, db.words.db

Let me see if I've got this straight: htdig creates db.docdb, & db.docdb.work.
htmerge creates db.docs.index, db.wordlist.work and db.words.db. htsearch
uses db.docdb, db.docs.index, and db.words.db. If I want to have my indices
searchable at the same time as I'm running htdig/htmerge, I'll need working
copies of
the (three, assuming I have that right) databases that htdig/htmerge create
htsearch uses. (Including this info on the page
would be very helpful.)

There was some note on the mailing list about doing separate merges of some
database files, and how that saved disk space. (I use htmerge -m to specify

Also, this means no running htfuzzy, right? If I don't run htfuzzy, I
don't get
the (VERY NICE) feature of ending expansion? I've had a look, and the
databases formed by htfuzzy (db.metaphone.db and db.soundex.db) are some of the
smaller ones - does this really gain me that much?

Thanks in advance for the assistance!


Susan Alderman Susan_Alderman AT brown.edu
Box 1885 vox: 401-863-9466
CIS, Brown University fax: 401-863-7329
Providence, RI 02912

To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Wed Dec 15 1999 - 15:26:00 PST