[htdig3-dev] Re: [htdig3-dev] Status


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 20 Jan 1999 09:25:46 -0600 (CST)


* List: htdig3-dev@sob.htdig.org

According to Geoff Hutchison:
> I also recall Gilles sending me an e-mail about a ChangeLog entry that I
> wrote for him, but I can't find it. :-(

Here it is again:

While I'm making minor corrections, I noticed that in htlib/URL.cc and
in ChangeLog, in 3.1.0dev-011799, you mentioned that my patch was to
"Fix looping in query string caused by slashes." It wasn't actually
a looping problem, but rather the patch was to strip off the query
string before trying to determine the parent path to an URL, so that
relative references from a CGI script expand correctly to a fully
qualified URL, without the superfluous junk that was being included
when the query string had slashes in it. I.e. the parent directory
must be determined by the last slash before the query string, and
slashes within the query string are to be ignored.

So, maybe a more appropriate description would be:

        Fix parent path logic to ignore slashes in query string.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
From - Thu Feb  4 22:09:15 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id IAA12970
	for <andrew@contigo.com>; Wed, 20 Jan 1999 08:00:09 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id IAA18441;
	Wed, 20 Jan 1999 08:09:06 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36A5FFB7.BeroList-2.5.5@sob.htdig.org>
Date: Wed, 20 Jan 1999 09:59:02 -0600 (CST)
In-Reply-To: <36A5E301.BeroList-2.5.5@sob.htdig.org> from "Geoff Hutchison" at Jan 20, 99 08:55:06 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: [htdig3-dev] Re: [htdig3-dev] StringMatch and duplicate documents

* List: htdig3-dev@sob.htdig.org

According to Geoff Hutchison: > Here's a summary of rebuilding my databases from scratch, before and after > the StringMatch changes. > > Before: .. > (No run output available, around 57,000 documents from both htdig and htmerge) > > After: .. > htdig: Run complete > htdig: 1 server seen: > htdig: wso.williams.edu:80 52906 documents > htdig: Errors to take note of: > > htmerge: Total word count: 86809 > htmerge: Total documents: 22320 > htmerge: Total doc db size (in K): 114880 > > > While I doubt there are any duplicate documents in the dbs after htmerge, > there seem to be *missing* documents. Is anyone else concerned about the > huge difference between htdig and htmerge?

Huston, we have a problem... :) Did you try the StringMatch patches in isolation? I'm wondering if the first or second patch is the problem, or both.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
From - Thu Feb  4 22:09:15 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id IAA13827
	for <andrew@contigo.com>; Wed, 20 Jan 1999 08:14:35 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id IAA18615;
	Wed, 20 Jan 1999 08:23:50 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36A60316.BeroList-2.5.5@sob.htdig.org>
Date: Wed, 20 Jan 1999 11:13:53 -0500 (EST)
In-Reply-To: <36A5FFB7.BeroList-2.5.5@sob.htdig.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: [htdig3-dev] Re: [htdig3-dev] Re: [htdig3-dev] StringMatch and duplicate documents

* List: htdig3-dev@sob.htdig.org

On Wed, 20 Jan 1999, Gilles Detillieux wrote:

> > While I doubt there are any duplicate documents in the dbs after htmerge, > > there seem to be *missing* documents. Is anyone else concerned about the > > huge difference between htdig and htmerge? > > Huston, we have a problem... :) Did you try the StringMatch patches in > isolation? I'm wondering if the first or second patch is the problem, or > both.

Alas, I tried them at the same time--I'm running the current CVS tree. I'm going to start debugging by running just htdig, which returned a number of documents in the right ballpark (I know I have around 50,000 webpages based on link checking.)

Then I'm going to take a look at the db and put some debugging code into htmerge.

Has anyone else noticed missing pages?

-Geoff From - Thu Feb 4 22:09:15 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id KAA19884 for <andrew@contigo.com>; Wed, 20 Jan 1999 10:06:18 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id KAA25577; Wed, 20 Jan 1999 10:15:26 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A61D4D.BeroList-2.5.5@sob.htdig.org> by williams.edu (PMDF V5.1-10 #24595) with ESMTP id <0F5V0037VE8XR6@williams.edu> for htdig3-dev@htdig.org; Wed, 20 Jan 1999 13:05:26 -0500 (EST) Date: Wed, 20 Jan 1999 13:03:46 -0400 MIME-version: 1.0 Content-type: text/plain; charset="us-ascii" Subject: [htdig3-dev] More db savings

* List: htdig3-dev@sob.htdig.org

I haven't figured out what the exact savings are...

I put a check into WordList::Word to bail quickly if we're trying to add a word with no weight. Duh. :-) Since it's a two line change, I'm not bothering to send it to the list, it's in the CVS tree and will be in the snapshot I make sometime today or early tomorrow morning.

I realized this fixes the "setting text_factor to 0 doesn't exclude the text" problem. This was reported a long time ago, and I never figured out what was going on.

BTW, I did a "ls -lR | grep htm" on my webserver and found 70,000+ files. So 50,000 is even a low number--I'm assuming there aren't 20,000 files that aren't linked to anything. Tonight I'm going to compare the "ls -lR" output to the dump of the database. If anyone can beat me to a solution, I'll be very happy.

-Geoff



This archive was generated by hypermail 2.0b3 on Thu Feb 04 1999 - 22:13:08 PST