Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 20 Jan 1999 16:11:42 -0600 (CST)
* List: htdig3-dev@sob.htdig.org
According to me:
> According to Geoff Hutchison:
> > > htdig: www.scrc.umanitoba.ca:80 410 documents
> > > htmerge: Total word count: 13042
> > > htmerge: Total documents: 419
> >
> > Have you ever wondered why htmerge sees more documents than htdig? You
> > clearly don't see the same problem that I do, but I still wonder about
> > your results. Have you ever compared db before and after merging?
>
> Yeah, I did wonder about that. However, it was doing the same thing
> even in 3.1.0b4, so it didn't seem to be a recent problem.
A few trace prints in htmerge/docs.cc revealed the source of the 9 extra
documents. These were 9 documents that were disallowed by robots.txt,
which were deleted from the DB, because they had no DocHead, but because
of a missing "else", they were still indexed and counted. Here's the fix:
--- ./htmerge/docs.cc.elsebug Wed Jan 6 21:13:50 1999
+++ ./htmerge/docs.cc Wed Jan 20 15:53:57 1999
@@ -80,15 +80,16 @@
if (strlen(ref->DocHead()) == 0)
{
// For some reason, this document doesn't have an excerpt
- // (probably because of a noindex directive) Remove it
+ // (probably because of a noindex directive, or disallowed
+ // by robots.txt or server_max_docs). Remove it
db.Delete(url->get());
}
- if ((ref->DocState()) == Reference_noindex)
+ else if ((ref->DocState()) == Reference_noindex)
{
// This document has been marked with a noindex tag. Remove it
db.Delete(url->get());
}
- if (remove_unused && discard_list.Exists(id))
+ else if (remove_unused && discard_list.Exists(id))
{
// This document is not valid anymore. Remove it
db.Delete(url->get());
@@ -104,7 +105,7 @@
cout << "htmerge: " << document_count << '\n';
cout.flush();
}
- }
+ }
delete ref;
}
if (verbose)
Now, the results are:
htdig: Run complete
htdig: 1 server seen:
htdig: www.scrc.umanitoba.ca:80 410 documents
htmerge: Total word count: 12912
htmerge: Total documents: 410
htmerge: Total doc db size (in K): 2482
total 8762
-rw-r--r-- 1 root root 1946624 Jan 20 16:03 db.docdb
-rw-r--r-- 1 root root 59392 Jan 20 16:03 db.docs.index
-rw-r--r-- 1 root root 336896 Jan 20 16:03 db.metaphone.db
-rw-r--r-- 1 root root 328704 Jan 20 16:03 db.soundex.db
-rw-r--r-- 1 root root 1950242 Jan 20 16:03 db.wordlist
-rw-r--r-- 1 root root 2534400 Jan 20 16:03 db.words.db
The DB sizes are slightly different than before, because I realised I
was mistakenly working with the 011299 snapshot before, not the 011799
snapshot. However, further testing showed no significant differences
between the two, with or without Hans-Peter's StringMatch patches.
-- Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
This archive was generated by hypermail 2.0b3 on Thu Feb 04 1999 - 22:24:20 PST