[htdig3-dev] Re: [htdig3-dev] Re: StringMatch and duplicate documents


Geoff Hutchison (ghutchis@wso.williams.edu)
Wed, 20 Jan 1999 14:33:20 -0500 (EST)


* List: htdig3-dev@sob.htdig.org

> A "grep htm" could find a lot more than just *.htm or *.html files. Also,
> does your server's robots.txt exclude any of these files? On my system,
> I only index a bit more than a quarter of all the html files I have under
> /home/httpd/html.

Yes, I agree the grep will over-estimate. I'm going to try several ways of
estimating the number. Our server doesn't have a robots.txt file, excludes
only "cgi-bin ?" and doesn't have any (significant--maybe 100 pages)
password-protected areas.

Another way the filesystem over-estimates is by ignoring the links. So
there may be lots of files that have no links to them.

> htdig: www.scrc.umanitoba.ca:80 410 documents
> htmerge: Total word count: 13042
> htmerge: Total documents: 419

Have you ever wondered why htmerge sees more documents than htdig? You
clearly don't see the same problem that I do, but I still wonder about
your results. Have you ever compared db before and after merging?

> maybe Didier's patch to teh db.wordlist field order had something to do

Yes, Didier's patch helps eliminate more duplicate word entries.

> source tree since the 011799 snapshot is to blame?

Possibly--I'll take a look aat recent changes. But the difference isn't
from the snapshot. I rebuild the source every night and reindex using the
latest CVS source. So it would be changes I made yesterday, which were
basically only Hans-Peter's patches.

-Geoff



This archive was generated by hypermail 2.0b3 on Thu Feb 04 1999 - 22:24:20 PST