[htdig3-dev] StringMatch and duplicate documents


Geoff Hutchison (ghutchis@wso.williams.edu)
Wed, 20 Jan 1999 08:55:06 -0400


* List: htdig3-dev@sob.htdig.org

Here's a summary of rebuilding my databases from scratch, before and after
the StringMatch changes.

Before:
-rw-r--r-- 1 htdig htdig 199081984 Jan 19 05:44 db.docdb
-rw-r--r-- 1 htdig htdig 199081984 Jan 19 05:40 db.docdb.work
-rw-r--r-- 1 htdig htdig 8492032 Jan 19 05:40 db.docs.index
-rw-r--r-- 1 htdig htdig 122348000 Jan 19 05:31 db.wordlist.work
-rw-r--r-- 1 htdig htdig 112433152 Jan 19 05:31 db.words.db

(No run output available, around 57,000 documents from both htdig and htmerge)

After:
-rw-r--r-- 1 htdig htdig 90511360 Jan 20 07:45 db.docdb
-rw-r--r-- 1 htdig htdig 90511360 Jan 20 07:44 db.docdb.work
-rw-r--r-- 1 htdig htdig 3305472 Jan 20 07:43 db.docs.index
-rw-r--r-- 1 htdig htdig 38475835 Jan 20 07:41 db.wordlist.work
-rw-r--r-- 1 htdig htdig 37135360 Jan 20 07:41 db.words.db

htdig: Run complete
htdig: 1 server seen:
htdig: wso.williams.edu:80 52906 documents
htdig: Errors to take note of:

htmerge: Total word count: 86809
htmerge: Total documents: 22320
htmerge: Total doc db size (in K): 114880

While I doubt there are any duplicate documents in the dbs after htmerge,
there seem to be *missing* documents. Is anyone else concerned about the
huge difference between htdig and htmerge?

-Geoff



This archive was generated by hypermail 2.0b3 on Thu Feb 04 1999 - 22:13:08 PST