Re: htdig: Problems with using htdig -a


Joe R. Jah (jjah@cloud.ccsf.cc.ca.us)
Thu, 17 Sep 1998 22:23:56 -0700 (PDT)


On Thu, 17 Sep 1998, Geoff Hutchison wrote:

> Date: Thu, 17 Sep 1998 23:47:47 -0400
> From: Geoff Hutchison <Geoffrey.R.Hutchison@williams.edu>
> To: htdig@sdsu.edu
> Subject: htdig: Problems with using htdig -a
>
> Hi,
>
> I consider the following a bug, since it's not documented. Fortunately
> there's an easy workaround.
>
> I normally run the dig with the switch -a to use alternate files (allowing
> others to search as I'm digging). Usually I don't use the switch -i, so it
> should do an "update" dig and index only the changed or new files (which
> should be a small subset of the 50,000 pages). Then the script moves the
> files into place at the end of the run.
>
> However, when using "-a" I wasn't seeing an update of the database.
> Essentially htdig looks at the db.docs.work file and found it empty. So it
> updates the empty db by doing a full initial dig. :-(
>
> Here's an example solution: (yes, you might want to ignore the first cp
> commands and change the first two mv commands to cp)
>
> BASEDIR=/opt/htdig
> cp $BASEDIR/db/db.wordlist $BASEDIR/db/db.wordlist.work
> cp $BASEDIR/db/db.docdb $BASEDIR/db/db.docdb.work
> $BASEDIR/bin/htdig -a -s
> $BASEDIR/bin/htmerge -a -s
> mv $BASEDIR/db/db.wordlist.work $BASEDIR/db/db.wordlist
> mv $BASEDIR/db/db.docdb.work $BASEDIR/db/db.docdb
> mv $BASEDIR/db/db.docs.index.work $BASEDIR/db/db.docs.index
> mv $BASEDIR/db/db.words.db.work $BASEDIR/db/db.words.db
>
> This changed a 1 hr. 30 min. dig into a 15 min dig, even counting the
> shuffling of files. Faster is better. :-)

I have 2809 documents on a local server; I also use the -a switch; it
normllyt takes about 12 minutes to rundig. I tried your easy workaround
and got the following results:

According to the report I have 3128 documents; it took about 14 minutes to
rundig. The size of my db files increased by about 30%:

-rw-r--r-- 1 jjah www 13281280 Sep 17 21:36 db.docdb
-rw-r--r-- 1 jjah www 10482688 Sep 17 02:33 db.docdb.old
-rw-r--r-- 1 jjah www 398336 Sep 17 21:35 db.docs.index
-rw-r--r-- 1 jjah www 343040 Sep 17 02:33 db.docs.index.old
-rw-r--r-- 1 jjah www 22928417 Sep 17 21:36 db.wordlist
-rw-r--r-- 1 jjah www 17329728 Sep 17 02:32 db.wordlist.old
-rw-r--r-- 1 jjah www 19543040 Sep 17 21:34 db.words.db
-rw-r--r-- 1 jjah www 15352832 Sep 17 02:32 db.words.db.old

I assume this increase in size of db files and theincrease in the reported
number of documents will be cumulative over time if one uses this
workaround; It will probably increase the actual search time as well;(

Joe

     _/ _/_/_/ _/ ____________ __o
     _/ _/ _/ _/ ______________ _-\<,_
 _/ _/ _/_/_/ _/ _/ ......(_)/ (_)
  _/_/ oe _/ _/. _/_/ ah jjah@cloud.ccsf.cc.ca.us

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:27:45 PST