Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 20 Jan 1999 13:18:24 -0600 (CST)
* List: htdig3-dev@sob.htdig.org
According to Geoff Hutchison:
> > > While I doubt there are any duplicate documents in the dbs after htmerge,
> > > there seem to be *missing* documents. Is anyone else concerned about the
> > > huge difference between htdig and htmerge?
> >
> > Huston, we have a problem... :) Did you try the StringMatch patches in
> > isolation? I'm wondering if the first or second patch is the problem, or
> > both.
>
> Alas, I tried them at the same time--I'm running the current CVS tree.
> I'm going to start debugging by running just htdig, which returned a
> number of documents in the right ballpark (I know I have around 50,000
> webpages based on link checking.)
>
> Then I'm going to take a look at the db and put some debugging code into
> htmerge.
>
> Has anyone else noticed missing pages?
Not me. See my results below...
> BTW, I did a "ls -lR | grep htm" on my webserver and found 70,000+ files.
> So 50,000 is even a low number--I'm assuming there aren't 20,000 files that
> aren't linked to anything. Tonight I'm going to compare the "ls -lR" output
> to the dump of the database. If anyone can beat me to a solution, I'll be
> very happy.
A "grep htm" could find a lot more than just *.htm or *.html files. Also,
does your server's robots.txt exclude any of these files? On my system,
I only index a bit more than a quarter of all the html files I have under
/home/httpd/html.
Anyway, here are my test results:
*** 3.1.0b4 ***
htdig: Run complete
htdig: 1 server seen:
htdig: www.scrc.umanitoba.ca:80 410 documents
htmerge: Total word count: 13042
htmerge: Total documents: 419
htmerge: Total doc db size (in K): 2482
total 7208
-rw-r--r-- 1 root root 1947648 Jan 20 12:09 db.docdb
-rw-r--r-- 1 root root 58368 Jan 20 12:09 db.docs.index
-rw-r--r-- 1 root root 430080 Jan 20 12:09 db.metaphone.db
-rw-r--r-- 1 root root 322560 Jan 20 12:09 db.soundex.db
-rw-r--r-- 1 root root 1990766 Jan 20 12:09 db.wordlist
-rw-r--r-- 1 root root 2593792 Jan 20 12:09 db.words.db
*** 3.1.0dev-011799 ***
htdig: Run complete
htdig: 1 server seen:
htdig: www.scrc.umanitoba.ca:80 410 documents
htmerge: Total word count: 12912
htmerge: Total documents: 419
htmerge: Total doc db size (in K): 2482
total 7078
-rw-r--r-- 1 root root 1946624 Jan 20 12:13 db.docdb
-rw-r--r-- 1 root root 56320 Jan 20 12:13 db.docs.index
-rw-r--r-- 1 root root 316416 Jan 20 12:13 db.metaphone.db
-rw-r--r-- 1 root root 313344 Jan 20 12:13 db.soundex.db
-rw-r--r-- 1 root root 1989674 Jan 20 12:12 db.wordlist
-rw-r--r-- 1 root root 2587648 Jan 20 12:12 db.words.db
*** 3.1.0dev-011799 with H-P's first StringMatch patch ***
htdig: Run complete
htdig: 1 server seen:
htdig: www.scrc.umanitoba.ca:80 410 documents
htmerge: Total word count: 12912
htmerge: Total documents: 419
htmerge: Total doc db size (in K): 2482
total 7078
-rw-r--r-- 1 root root 1946624 Jan 20 12:36 db.docdb
-rw-r--r-- 1 root root 56320 Jan 20 12:36 db.docs.index
-rw-r--r-- 1 root root 316416 Jan 20 12:37 db.metaphone.db
-rw-r--r-- 1 root root 313344 Jan 20 12:37 db.soundex.db
-rw-r--r-- 1 root root 1989674 Jan 20 12:36 db.wordlist
-rw-r--r-- 1 root root 2587648 Jan 20 12:36 db.words.db
*** 3.1.0dev-011799 with H-P's third version of his 2nd StringMatch patch ***
htdig: Run complete
htdig: 1 server seen:
htdig: www.scrc.umanitoba.ca:80 410 documents
htmerge: Total word count: 12912
htmerge: Total documents: 419
htmerge: Total doc db size (in K): 2482
total 7078
-rw-r--r-- 1 root root 1946624 Jan 20 12:39 db.docdb
-rw-r--r-- 1 root root 56320 Jan 20 12:39 db.docs.index
-rw-r--r-- 1 root root 316416 Jan 20 12:39 db.metaphone.db
-rw-r--r-- 1 root root 313344 Jan 20 12:39 db.soundex.db
-rw-r--r-- 1 root root 1989674 Jan 20 12:39 db.wordlist
-rw-r--r-- 1 root root 2587648 Jan 20 12:39 db.words.db
I don't know why the total word count dropped from b4 to dev-011799, but
maybe Didier's patch to teh db.wordlist field order had something to do
with it. In any case, Hans-Peter's StringMatch patches didn't seems to
affect my htdig/htmerge stats at all. Maybe some other change to the
source tree since the 011799 snapshot is to blame?
-- Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 From - Thu Feb 4 22:09:15 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id LAA24955 for <andrew@contigo.com>; Wed, 20 Jan 1999 11:33:57 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id LAA26464; Wed, 20 Jan 1999 11:43:21 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A631DA.BeroList-2.5.5@sob.htdig.org> Date: Wed, 20 Jan 1999 14:33:20 -0500 (EST) In-Reply-To: <36A62E60.BeroList-2.5.5@sob.htdig.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: [htdig3-dev] Re: [htdig3-dev] Re: StringMatch and duplicate documents* List: htdig3-dev@sob.htdig.org
> A "grep htm" could find a lot more than just *.htm or *.html files. Also, > does your server's robots.txt exclude any of these files? On my system, > I only index a bit more than a quarter of all the html files I have under > /home/httpd/html.
Yes, I agree the grep will over-estimate. I'm going to try several ways of estimating the number. Our server doesn't have a robots.txt file, excludes only "cgi-bin ?" and doesn't have any (significant--maybe 100 pages) password-protected areas.
Another way the filesystem over-estimates is by ignoring the links. So there may be lots of files that have no links to them.
> htdig: www.scrc.umanitoba.ca:80 410 documents > htmerge: Total word count: 13042 > htmerge: Total documents: 419
Have you ever wondered why htmerge sees more documents than htdig? You clearly don't see the same problem that I do, but I still wonder about your results. Have you ever compared db before and after merging?
> maybe Didier's patch to teh db.wordlist field order had something to do
Yes, Didier's patch helps eliminate more duplicate word entries.
> source tree since the 011799 snapshot is to blame?
Possibly--I'll take a look aat recent changes. But the difference isn't from the snapshot. I rebuild the source every night and reindex using the latest CVS source. So it would be changes I made yesterday, which were basically only Hans-Peter's patches.
-Geoff From - Thu Feb 4 22:09:15 1999 Return-Path: <grdetil@scrc.umanitoba.ca> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id LAA25711 for <andrew@contigo.com>; Wed, 20 Jan 1999 11:48:07 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id LAA26643; Wed, 20 Jan 1999 11:57:31 -0800 (PST) From: Gilles Detillieux <grdetil@scrc.umanitoba.ca> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A6352C.BeroList-2.5.5@sob.htdig.org> Date: Wed, 20 Jan 1999 13:47:23 -0600 (CST) In-Reply-To: <36A631DA.BeroList-2.5.5@sob.htdig.org> from "Geoff Hutchison" at Jan 20, 99 02:33:20 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: [htdig3-dev] Re: StringMatch and duplicate documents
* List: htdig3-dev@sob.htdig.org
According to Geoff Hutchison: > > htdig: www.scrc.umanitoba.ca:80 410 documents > > htmerge: Total word count: 13042 > > htmerge: Total documents: 419 > > Have you ever wondered why htmerge sees more documents than htdig? You > clearly don't see the same problem that I do, but I still wonder about > your results. Have you ever compared db before and after merging?
Yeah, I did wonder about that. However, it was doing the same thing even in 3.1.0b4, so it didn't seem to be a recent problem.
> > source tree since the 011799 snapshot is to blame? > > Possibly--I'll take a look aat recent changes. But the difference isn't > from the snapshot. I rebuild the source every night and reindex using the > latest CVS source. So it would be changes I made yesterday, which were > basically only Hans-Peter's patches.
Hmmm. Were there other changes to StringMatch that would have caused Hans-Peter's patches not to apply properly to your source? Did you use only his 1st, and the 3rd version of his 2nd patch?
-- Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 From - Thu Feb 4 22:09:15 1999 Return-Path: <grdetil@scrc.umanitoba.ca> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id OAA00427 for <andrew@contigo.com>; Wed, 20 Jan 1999 14:12:33 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id OAA27082; Wed, 20 Jan 1999 14:21:53 -0800 (PST) From: Gilles Detillieux <grdetil@scrc.umanitoba.ca> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A65705.BeroList-2.5.5@sob.htdig.org> Date: Wed, 20 Jan 1999 16:11:42 -0600 (CST) In-Reply-To: <36A6352C.BeroList-2.5.5@sob.htdig.org> from "Gilles Detillieux" at Jan 20, 99 01:47:23 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: [htdig3-dev] Re: StringMatch and duplicate documents* List: htdig3-dev@sob.htdig.org
According to me: > According to Geoff Hutchison: > > > htdig: www.scrc.umanitoba.ca:80 410 documents > > > htmerge: Total word count: 13042 > > > htmerge: Total documents: 419 > > > > Have you ever wondered why htmerge sees more documents than htdig? You > > clearly don't see the same problem that I do, but I still wonder about > > your results. Have you ever compared db before and after merging? > > Yeah, I did wonder about that. However, it was doing the same thing > even in 3.1.0b4, so it didn't seem to be a recent problem.
A few trace prints in htmerge/docs.cc revealed the source of the 9 extra documents. These were 9 documents that were disallowed by robots.txt, which were deleted from the DB, because they had no DocHead, but because of a missing "else", they were still indexed and counted. Here's the fix:
--- ./htmerge/docs.cc.elsebug Wed Jan 6 21:13:50 1999 +++ ./htmerge/docs.cc Wed Jan 20 15:53:57 1999 @@ -80,15 +80,16 @@ if (strlen(ref->DocHead()) == 0) { // For some reason, this document doesn't have an excerpt - // (probably because of a noindex directive) Remove it + // (probably because of a noindex directive, or disallowed + // by robots.txt or server_max_docs). Remove it db.Delete(url->get()); } - if ((ref->DocState()) == Reference_noindex) + else if ((ref->DocState()) == Reference_noindex) { // This document has been marked with a noindex tag. Remove it db.Delete(url->get()); } - if (remove_unused && discard_list.Exists(id)) + else if (remove_unused && discard_list.Exists(id)) { // This document is not valid anymore. Remove it db.Delete(url->get()); @@ -104,7 +105,7 @@ cout << "htmerge: " << document_count << '\n'; cout.flush(); } - } + } delete ref; } if (verbose)
Now, the results are:
htdig: Run complete htdig: 1 server seen: htdig: www.scrc.umanitoba.ca:80 410 documents htmerge: Total word count: 12912 htmerge: Total documents: 410 htmerge: Total doc db size (in K): 2482 total 8762 -rw-r--r-- 1 root root 1946624 Jan 20 16:03 db.docdb -rw-r--r-- 1 root root 59392 Jan 20 16:03 db.docs.index -rw-r--r-- 1 root root 336896 Jan 20 16:03 db.metaphone.db -rw-r--r-- 1 root root 328704 Jan 20 16:03 db.soundex.db -rw-r--r-- 1 root root 1950242 Jan 20 16:03 db.wordlist -rw-r--r-- 1 root root 2534400 Jan 20 16:03 db.words.db
The DB sizes are slightly different than before, because I realised I was mistakenly working with the 011299 snapshot before, not the 011799 snapshot. However, further testing showed no significant differences between the two, with or without Hans-Peter's StringMatch patches.
-- Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 From - Thu Feb 4 22:09:15 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id QAA05724 for <andrew@contigo.com>; Wed, 20 Jan 1999 16:05:52 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id QAA27452; Wed, 20 Jan 1999 16:15:19 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A6719A.BeroList-2.5.5@sob.htdig.org> by williams.edu (PMDF V5.1-10 #24595) with ESMTP id <0F5V00E2LUWJGB@williams.edu> for htdig3-dev@htdig.org; Wed, 20 Jan 1999 19:05:11 -0500 (EST) Date: Wed, 20 Jan 1999 18:59:06 -0400 In-reply-to: <36A65705.BeroList-2.5.5@sob.htdig.org> MIME-version: 1.0 Content-type: text/plain; charset="us-ascii" References: <36A6352C.BeroList-2.5.5@sob.htdig.org> Subject: [htdig3-dev] Re: [htdig3-dev] Re: StringMatch and duplicate documents* List: htdig3-dev@sob.htdig.org
At 6:11 PM -0400 1/20/99, Gilles Detillieux wrote:
>A few trace prints in htmerge/docs.cc revealed the source of the 9 extra >documents. These were 9 documents that were disallowed by robots.txt, >which were deleted from the DB, because they had no DocHead, but because >of a missing "else", they were still indexed and counted. Here's the fix:
I don't know if I believe it. That seemed to do it... After patching, recompiling and re-running htmerge, I get:
htmerge: Total documents: 58193 htmerge: Total doc db size (in K): 330586
No complaints here. Leo, are you still seeing duplicate URLs?
-Geoff
This archive was generated by hypermail 2.0b3 on Thu Feb 04 1999 - 22:13:08 PST