[htdig3-dev] Re: StringMatch and duplicate documents


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 20 Jan 1999 13:18:24 -0600 (CST)


* List: htdig3-dev@sob.htdig.org

According to Geoff Hutchison:
> > > While I doubt there are any duplicate documents in the dbs after htmerge,
> > > there seem to be *missing* documents. Is anyone else concerned about the
> > > huge difference between htdig and htmerge?
> >
> > Huston, we have a problem... :) Did you try the StringMatch patches in
> > isolation? I'm wondering if the first or second patch is the problem, or
> > both.
>
> Alas, I tried them at the same time--I'm running the current CVS tree.
> I'm going to start debugging by running just htdig, which returned a
> number of documents in the right ballpark (I know I have around 50,000
> webpages based on link checking.)
>
> Then I'm going to take a look at the db and put some debugging code into
> htmerge.
>
> Has anyone else noticed missing pages?

Not me. See my results below...

> BTW, I did a "ls -lR | grep htm" on my webserver and found 70,000+ files.
> So 50,000 is even a low number--I'm assuming there aren't 20,000 files that
> aren't linked to anything. Tonight I'm going to compare the "ls -lR" output
> to the dump of the database. If anyone can beat me to a solution, I'll be
> very happy.

A "grep htm" could find a lot more than just *.htm or *.html files. Also,
does your server's robots.txt exclude any of these files? On my system,
I only index a bit more than a quarter of all the html files I have under
/home/httpd/html.

Anyway, here are my test results:

*** 3.1.0b4 ***
htdig: Run complete
htdig: 1 server seen:
htdig: www.scrc.umanitoba.ca:80 410 documents
htmerge: Total word count: 13042
htmerge: Total documents: 419
htmerge: Total doc db size (in K): 2482
total 7208
-rw-r--r-- 1 root root 1947648 Jan 20 12:09 db.docdb
-rw-r--r-- 1 root root 58368 Jan 20 12:09 db.docs.index
-rw-r--r-- 1 root root 430080 Jan 20 12:09 db.metaphone.db
-rw-r--r-- 1 root root 322560 Jan 20 12:09 db.soundex.db
-rw-r--r-- 1 root root 1990766 Jan 20 12:09 db.wordlist
-rw-r--r-- 1 root root 2593792 Jan 20 12:09 db.words.db
*** 3.1.0dev-011799 ***
htdig: Run complete
htdig: 1 server seen:
htdig: www.scrc.umanitoba.ca:80 410 documents
htmerge: Total word count: 12912
htmerge: Total documents: 419
htmerge: Total doc db size (in K): 2482
total 7078
-rw-r--r-- 1 root root 1946624 Jan 20 12:13 db.docdb
-rw-r--r-- 1 root root 56320 Jan 20 12:13 db.docs.index
-rw-r--r-- 1 root root 316416 Jan 20 12:13 db.metaphone.db
-rw-r--r-- 1 root root 313344 Jan 20 12:13 db.soundex.db
-rw-r--r-- 1 root root 1989674 Jan 20 12:12 db.wordlist
-rw-r--r-- 1 root root 2587648 Jan 20 12:12 db.words.db
*** 3.1.0dev-011799 with H-P's first StringMatch patch ***
htdig: Run complete
htdig: 1 server seen:
htdig: www.scrc.umanitoba.ca:80 410 documents
htmerge: Total word count: 12912
htmerge: Total documents: 419
htmerge: Total doc db size (in K): 2482
total 7078
-rw-r--r-- 1 root root 1946624 Jan 20 12:36 db.docdb
-rw-r--r-- 1 root root 56320 Jan 20 12:36 db.docs.index
-rw-r--r-- 1 root root 316416 Jan 20 12:37 db.metaphone.db
-rw-r--r-- 1 root root 313344 Jan 20 12:37 db.soundex.db
-rw-r--r-- 1 root root 1989674 Jan 20 12:36 db.wordlist
-rw-r--r-- 1 root root 2587648 Jan 20 12:36 db.words.db
*** 3.1.0dev-011799 with H-P's third version of his 2nd StringMatch patch ***
htdig: Run complete
htdig: 1 server seen:
htdig: www.scrc.umanitoba.ca:80 410 documents
htmerge: Total word count: 12912
htmerge: Total documents: 419
htmerge: Total doc db size (in K): 2482
total 7078
-rw-r--r-- 1 root root 1946624 Jan 20 12:39 db.docdb
-rw-r--r-- 1 root root 56320 Jan 20 12:39 db.docs.index
-rw-r--r-- 1 root root 316416 Jan 20 12:39 db.metaphone.db
-rw-r--r-- 1 root root 313344 Jan 20 12:39 db.soundex.db
-rw-r--r-- 1 root root 1989674 Jan 20 12:39 db.wordlist
-rw-r--r-- 1 root root 2587648 Jan 20 12:39 db.words.db

I don't know why the total word count dropped from b4 to dev-011799, but
maybe Didier's patch to teh db.wordlist field order had something to do
with it. In any case, Hans-Peter's StringMatch patches didn't seems to
affect my htdig/htmerge stats at all. Maybe some other change to the
source tree since the 011799 snapshot is to blame?

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930



This archive was generated by hypermail 2.0b3 on Thu Feb 04 1999 - 22:24:19 PST