Hans-Peter Nilsson (hans-peter.nilsson@axis.com)
Tue, 19 Jan 1999 17:50:18 +0100
* List: htdig3-dev@sob.htdig.org
> Date: Tue, 19 Jan 1999 17:22:19 +0100 (MEZ)
> From: Alexander Bergolth <leo@strike.wu-wien.ac.at>
> docState is an enum and enums can only be converted to integers but not
> the other way around.
I know, I did kind of let this slip :-( even though gcc-2.7.2
(gasp! antique!) gave a warning and me being aware of issues
with current and past C++ standards. I was just lazy or
something. Sorry.
I'll find a fix for it within 12h.
There's trickiness in that I cannot assign (as above) and cannot
do a straightforward memcpy since I cannot assume the size of
any specific enumerated type; it may vary with the enumerations.
And don't tell me about a templated assignment operator
function, because templates are IMHO out of question for
ht://Dig.
Anyway, it's doable with a little memcpy-ugliness (but fully
standard-conformant, portable and IMO TRT) as long as the enum
only takes the sizes of char, short or int.
brgds, H-P
From - Thu Feb 4 22:09:14 1999
Return-Path: <hans-peter.nilsson@axis.com>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id LAA25380
for <andrew@contigo.com>; Tue, 19 Jan 1999 11:57:29 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA12821;
Tue, 19 Jan 1999 12:05:59 -0800 (PST)
From: Hans-Peter Nilsson <hans-peter.nilsson@axis.com>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36A4E5B9.BeroList-2.5.5@sob.htdig.org>
Date: Tue, 19 Jan 1999 20:56:28 +0100
Subject: [htdig3-dev] Patch 1 for StringMatch bugs
* List: htdig3-dev@sob.htdig.org
When putting a "found" note in the state for the last "ad" in
the pattern "adcf|aded|ad" it is misplaced and clobbers the
last "d" in "aded" so it falsely reports matching "ad" for
e.g. "adedf" (but still with length 4).
However, this bug also causes "ad" *not* to be found in smaller
strings not containing "aded", like "adg".
This only happens if you put smaller, ambiguating, matches
after larger matches in a pattern.
Tue Jan 19 12:55:36 1999 Hans-Peter Nilsson <hp@axis.se>
* htlib/StringMatch.cc (Pattern): Always set PreviousState before
checking PreviousValue.
*** ../../cvs_latest_pure/htdig3/htlib/StringMatch.cc Wed Dec 2 03:45:58 1998
--- ./StringMatch.cc Tue Jan 19 12:55:22 1999
*************** StringMatch::Pattern(char *pattern)
*** 139,144 ****
--- 139,145 ----
else
{
previousValue = table[chr][state];
+ previousState = state;
if (previousValue)
{
if (previousValue & FINAL)
*************** StringMatch::Pattern(char *pattern)
*** 150,156 ****
else
{
table[chr][state] |= ++totalStates;
- previousState = state;
state = totalStates;
}
}
--- 151,156 ----
*************** StringMatch::Pattern(char *pattern)
*** 162,168 ****
else
{
table[chr][state] = ++totalStates;
- previousState = state;
state = totalStates;
}
}
--- 162,167 ----
brgds, H-P
From - Thu Feb 4 22:09:14 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id LAA25467
for <andrew@contigo.com>; Tue, 19 Jan 1999 11:59:11 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA12835;
Tue, 19 Jan 1999 12:08:02 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36A4E623.BeroList-2.5.5@sob.htdig.org>
Date: Tue, 19 Jan 1999 14:58:33 -0500 (EST)
In-Reply-To: <36A4A85B.BeroList-2.5.5@sob.htdig.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: [htdig3-dev] Re: [htdig3-dev] Re: [htdig3-dev] Re: Zlib compression
* List: htdig3-dev@sob.htdig.org
On Tue, 19 Jan 1999, Gilles Detillieux wrote:
> I think I can speculate on the 2nd question. For every href to a given
> URL, htdig will fetch, modify and store the DocumentRef for that URL.
> That means a Deserialize and a Serialize for each href, plus one for
> the document itself.
So this is a side-effect of the AddDescription? I wonder if there's a way
we can only do the Deserialize/Serialize when we're actually adding the
description.
Or, as Didier points out, we can only compress parts of the DocumentRef.
This would escape some of the slowdown in deflate(). In other words, maybe
we compress DocHead. Then we have the methods to access DocHead to the
compression/decompression *only* when DocHead is needed.
> I'd guess Didier's site is averaging 42 hrefs per URL, though that still
> seems rather high!
That was my assumption--that there's too many calls to be readily
explained. Maybe we can figure out some sort of debugging trace for calls
to [] (which will then go to Deserialize).
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/
From - Thu Feb 4 22:09:14 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id MAA25540
for <andrew@contigo.com>; Tue, 19 Jan 1999 12:00:11 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA12849;
Tue, 19 Jan 1999 12:08:58 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36A4E65B.BeroList-2.5.5@sob.htdig.org>
Date: Tue, 19 Jan 1999 14:59:25 -0500 (EST)
In-Reply-To: <36A4A9E8.BeroList-2.5.5@sob.htdig.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: [htdig3-dev] [htdig3-dev] Patch to get rid of 18% of db.words.db
* List: htdig3-dev@sob.htdig.org
On Tue, 19 Jan 1999, Gilles Detillieux wrote:
> really care) that makes the writing of the count out to the DB conditional.
Agreed. Is this OK with you Hans-Peter?
-Geoff
From - Thu Feb 4 22:09:14 1999
Return-Path: <hans-peter.nilsson@axis.com>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id MAA25642
for <andrew@contigo.com>; Tue, 19 Jan 1999 12:02:06 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA12858;
Tue, 19 Jan 1999 12:10:33 -0800 (PST)
From: Hans-Peter Nilsson <hans-peter.nilsson@axis.com>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36A4E6BA.BeroList-2.5.5@sob.htdig.org>
Date: Tue, 19 Jan 1999 21:01:01 +0100
Subject: [htdig3-dev] Patch 2 for StringMatch bugs
* List: htdig3-dev@sob.htdig.org
My previous bugfix exposes the non-greediness of StringMatch.
But (watch outk, here it comes), greed is good. :-) You *do*
want it to match "area" instead of just "a" in HTML.cc, right?
Tue Jan 19 13:39:53 1999 Hans-Peter Nilsson <hp@axis.se>
* htlib/StringMatch.cc (FindFirst): Be "greedy"; match longest.
(Compare): Ditto.
(Watch out, edited patch - will apply with "fuzz").
*** ../../cvs_latest_pure/htdig3/htlib/StringMatch.cc Wed Dec 2 03:45:58 1998
--- ./StringMatch.cc Tue Jan 19 14:04:34 1999
*************** int StringMatch::FindFirst(char *string,
*** 211,216 ****
--- 210,219 ----
//
if (state)
{
+ // But we may already have a match, and are just being greedy.
+ if (which != -1)
+ return start_pos;
+
pos = start_pos + 1;
}
else
*************** int StringMatch::FindFirst(char *string,
*** 227,236 ****
//
which = ((state & MATCH_INDEX_MASK) >> INDEX_SHIFT) - 1;
length = pos - start_pos + 1;
! return start_pos;
}
pos++;
}
return -1;
}
--- 230,248 ----
//
which = ((state & MATCH_INDEX_MASK) >> INDEX_SHIFT) - 1;
length = pos - start_pos + 1;
! state &= STATE_MASK;
!
! // Continue to find the longest, if there is one.
! if (state == 0)
! return start_pos;
}
pos++;
}
+
+ // Maybe we were too greedy.
+ if (which != -1)
+ return start_pos;
+
return -1;
}
*************** int StringMatch::Compare(char *string, i
*** 260,265 ****
--- 272,281 ----
{
if (state == 0)
{
+ // We may already have a match, and are just being greedy.
+ if (which != -1)
+ return start_pos;
+
start_pos = pos;
}
}
*************** int StringMatch::Compare(char *string, i
*** 275,284 ****
//
which = ((state & MATCH_INDEX_MASK) >> INDEX_SHIFT) - 1;
length = pos - start_pos + 1;
! return 1;
}
pos++;
}
return 0;
}
--- 291,309 ----
//
which = ((state & MATCH_INDEX_MASK) >> INDEX_SHIFT) - 1;
length = pos - start_pos + 1;
!
! // Continue to find the longest, if there is one.
! state &= STATE_MASK;
! if (state == 0)
! return 1;
}
pos++;
}
+
+ // Maybe we were too greedy.
+ if (which != -1)
+ return start_pos;
+
return 0;
}
brgds, H-P
From - Thu Feb 4 22:09:14 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id MAA26039
for <andrew@contigo.com>; Tue, 19 Jan 1999 12:08:55 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA12887;
Tue, 19 Jan 1999 12:17:30 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36A4E85B.BeroList-2.5.5@sob.htdig.org>
Date: Tue, 19 Jan 1999 15:08:00 -0500 (EST)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: [htdig3-dev] Re: Duplicate Keys in db.docs.index
* List: htdig3-dev@sob.htdig.org
A few days ago, Leo reported that he dumped his db.doc.db file and
found duplicate keys. Since we actually store the DocumentRefs in this
file and we do so using URLs, this seemed weird to both of us.
I have an idea about this. The Berkeley DB allows duplicate keys to be
added if DB_DUP is set as a flag. Several parts of htlib/DB2_db.cc make
notes of this (and shouldn't set it). So I'm wondering if the API changes
since our original (2.4.x) copy of the database code have turned on
DB_DUP. When I upgraded to 2.6.x, I had to make changes for the cursors.
If other people can confirm duplicate URLs in their databases, it's a
real problem--I'm considering it a showstopper for 3.1.0.
-Geoff
From - Thu Feb 4 22:09:14 1999
Return-Path: <hans-peter.nilsson@axis.com>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id NAA28864
for <andrew@contigo.com>; Tue, 19 Jan 1999 13:00:31 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id NAA13112;
Tue, 19 Jan 1999 13:09:17 -0800 (PST)
From: Hans-Peter Nilsson <hans-peter.nilsson@axis.com>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36A4F47F.BeroList-2.5.5@sob.htdig.org>
Date: Tue, 19 Jan 1999 21:59:45 +0100
Subject: [htdig3-dev] Replacement for "Patch 2 for StringMatch bugs"
* List: htdig3-dev@sob.htdig.org
Oops. I cut-and-pasted the greediness into Compare with my
brain in "off" position. Here's a replacement for patch 2.
The ChangeLog note was correct...
BTW, note that the CompareWord and FindFirstWord already handle
the greediness in their own way (so it seems to me) and do not
need to be tinkered with.
Index: StringMatch.cc
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htlib/StringMatch.cc,v
retrieving revision 1.4
diff -p -c -r1.4 StringMatch.cc
*** StringMatch.cc 1998/12/02 02:45:58 1.4
--- StringMatch.cc 1999/01/19 21:06:03
*************** int StringMatch::FindFirst(char *string,
*** 211,216 ****
--- 211,220 ----
//
if (state)
{
+ // But we may already have a match, and are just being greedy.
+ if (which != -1)
+ return start_pos;
+
pos = start_pos + 1;
}
else
*************** int StringMatch::FindFirst(char *string,
*** 227,236 ****
//
which = ((state & MATCH_INDEX_MASK) >> INDEX_SHIFT) - 1;
length = pos - start_pos + 1;
! return start_pos;
}
pos++;
}
return -1;
}
--- 231,249 ----
//
which = ((state & MATCH_INDEX_MASK) >> INDEX_SHIFT) - 1;
length = pos - start_pos + 1;
! state &= STATE_MASK;
!
! // Continue to find the longest, if there is one.
! if (state == 0)
! return start_pos;
}
pos++;
}
+
+ // Maybe we were too greedy.
+ if (which != -1)
+ return start_pos;
+
return -1;
}
*************** int StringMatch::Compare(char *string, i
*** 265,270 ****
--- 278,287 ----
}
else
{
+ // We may already have a match, and are just being greedy.
+ if (which != -1)
+ return start_pos;
+
return 0;
}
state = new_state;
*************** int StringMatch::Compare(char *string, i
*** 275,284 ****
//
which = ((state & MATCH_INDEX_MASK) >> INDEX_SHIFT) - 1;
length = pos - start_pos + 1;
! return 1;
}
pos++;
}
return 0;
}
--- 292,310 ----
//
which = ((state & MATCH_INDEX_MASK) >> INDEX_SHIFT) - 1;
length = pos - start_pos + 1;
!
! // Continue to find the longest, if there is one.
! state &= STATE_MASK;
! if (state == 0)
! return 1;
}
pos++;
}
+
+ // Maybe we were too greedy.
+ if (which != -1)
+ return start_pos;
+
return 0;
}
Now, how do I set it to "on"? :-)
brgds, H-P
From - Thu Feb 4 22:09:14 1999
Return-Path: <hans-peter.nilsson@axis.com>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id NAA29572
for <andrew@contigo.com>; Tue, 19 Jan 1999 13:13:53 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id NAA13160;
Tue, 19 Jan 1999 13:22:13 -0800 (PST)
From: Hans-Peter Nilsson <hans-peter.nilsson@axis.com>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36A4F786.BeroList-2.5.5@sob.htdig.org>
Date: Tue, 19 Jan 1999 22:12:40 +0100
Subject: [htdig3-dev] Second replacement for "Patch 2 for StringMatch bugs"...
* List: htdig3-dev@sob.htdig.org
Well, I'm blushing with embarrasement in public. Compare is
supposed to return "1", never "0", as start_pos may be.
Apologies for wasting everybodys time with firing off incorrect
patches. Hope this is correct and the last. Lesson learned(?):
*never* forget to scrutinize *any* of your patches.
Index: StringMatch.cc
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htlib/StringMatch.cc,v
retrieving revision 1.4
diff -p -c -r1.4 StringMatch.cc
*** StringMatch.cc 1998/12/02 02:45:58 1.4
--- StringMatch.cc 1999/01/19 21:20:18
*************** int StringMatch::FindFirst(char *string,
*** 211,216 ****
--- 211,220 ----
//
if (state)
{
+ // But we may already have a match, and are just being greedy.
+ if (which != -1)
+ return start_pos;
+
pos = start_pos + 1;
}
else
*************** int StringMatch::FindFirst(char *string,
*** 227,236 ****
//
which = ((state & MATCH_INDEX_MASK) >> INDEX_SHIFT) - 1;
length = pos - start_pos + 1;
! return start_pos;
}
pos++;
}
return -1;
}
--- 231,249 ----
//
which = ((state & MATCH_INDEX_MASK) >> INDEX_SHIFT) - 1;
length = pos - start_pos + 1;
! state &= STATE_MASK;
!
! // Continue to find the longest, if there is one.
! if (state == 0)
! return start_pos;
}
pos++;
}
+
+ // Maybe we were too greedy.
+ if (which != -1)
+ return start_pos;
+
return -1;
}
*************** int StringMatch::Compare(char *string, i
*** 265,270 ****
--- 278,287 ----
}
else
{
+ // We may already have a match, and are just being greedy.
+ if (which != -1)
+ return 1;
+
return 0;
}
state = new_state;
*************** int StringMatch::Compare(char *string, i
*** 275,284 ****
//
which = ((state & MATCH_INDEX_MASK) >> INDEX_SHIFT) - 1;
length = pos - start_pos + 1;
! return 1;
}
pos++;
}
return 0;
}
--- 292,310 ----
//
which = ((state & MATCH_INDEX_MASK) >> INDEX_SHIFT) - 1;
length = pos - start_pos + 1;
!
! // Continue to find the longest, if there is one.
! state &= STATE_MASK;
! if (state == 0)
! return 1;
}
pos++;
}
+
+ // Maybe we were too greedy.
+ if (which != -1)
+ return 1;
+
return 0;
}
brgds, H-P
From - Thu Feb 4 22:09:14 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id NAA30528
for <andrew@contigo.com>; Tue, 19 Jan 1999 13:33:00 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id NAA13220;
Tue, 19 Jan 1999 13:41:35 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36A4FC10.BeroList-2.5.5@sob.htdig.org>
Date: Tue, 19 Jan 1999 15:31:57 -0600 (CST)
In-Reply-To: <36A4E5B9.BeroList-2.5.5@sob.htdig.org> from "Hans-Peter Nilsson" at Jan 19, 99 08:56:28 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: [htdig3-dev] Re: [htdig3-dev] Patch 1 for StringMatch bugs
* List: htdig3-dev@sob.htdig.org
According to Hans-Peter Nilsson:
> When putting a "found" note in the state for the last "ad" in
> the pattern "adcf|aded|ad" it is misplaced and clobbers the
> last "d" in "aded" so it falsely reports matching "ad" for
> e.g. "adedf" (but still with length 4).
> However, this bug also causes "ad" *not* to be found in smaller
> strings not containing "aded", like "adg".
> This only happens if you put smaller, ambiguating, matches
> after larger matches in a pattern.
Ah! This seems like exactly the sort of problem Benoit Majeau reported
back on Dec. 9, in his "StringMatch bug?" message.
http://www.htdig.org/mail/1998-12/0160.html
Good work!
-- Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 From - Thu Feb 4 22:09:14 1999 Return-Path: <grdetil@scrc.umanitoba.ca> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id NAA31158 for <andrew@contigo.com>; Tue, 19 Jan 1999 13:44:00 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id NAA13241; Tue, 19 Jan 1999 13:52:49 -0800 (PST) From: Gilles Detillieux <grdetil@scrc.umanitoba.ca> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A4FEB3.BeroList-2.5.5@sob.htdig.org> Date: Tue, 19 Jan 1999 15:43:09 -0600 (CST) In-Reply-To: <36A4E623.BeroList-2.5.5@sob.htdig.org> from "Geoff Hutchison" at Jan 19, 99 02:58:33 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: [htdig3-dev] Re: Zlib compression* List: htdig3-dev@sob.htdig.org
According to Geoff Hutchison: > On Tue, 19 Jan 1999, Gilles Detillieux wrote: > > I think I can speculate on the 2nd question. For every href to a given > > URL, htdig will fetch, modify and store the DocumentRef for that URL. > > That means a Deserialize and a Serialize for each href, plus one for > > the document itself. > > So this is a side-effect of the AddDescription? I wonder if there's a way > we can only do the Deserialize/Serialize when we're actually adding the > description. > > Or, as Didier points out, we can only compress parts of the DocumentRef. > This would escape some of the slowdown in deflate(). In other words, maybe > we compress DocHead. Then we have the methods to access DocHead to the > compression/decompression *only* when DocHead is needed.
That sounds reasonable to me. I'd bet that the other fields are too small to get decent compression anyway, but I may be wrong. In any case, if we only compress/decompress the DocHead as needed, that would greatly cut down the number of times we'd need to do that (once per document).
> > I'd guess Didier's site is averaging 42 hrefs per URL, though that still > > seems rather high! > > That was my assumption--that there's too many calls to be readily > explained. Maybe we can figure out some sort of debugging trace for calls > to [] (which will then go to Deserialize).
Well, that depends on the files he's indexing. E.g. cf_byname.html has way more than 42 hrefs to attrs.html. Maybe Didier can comment on this. Is the number of Serialize/Deserialize calls way too high for the stuff he's indexing? If so, then yes, some debugging traces would be in order.
-- Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 From - Thu Feb 4 22:09:14 1999 Return-Path: <hans-peter.nilsson@axis.com> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id OAA00371 for <andrew@contigo.com>; Tue, 19 Jan 1999 14:12:39 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id OAA13295; Tue, 19 Jan 1999 14:21:26 -0800 (PST) From: Hans-Peter Nilsson <hans-peter.nilsson@axis.com> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A50568.BeroList-2.5.5@sob.htdig.org> Date: Tue, 19 Jan 1999 23:11:53 +0100 CC: htdig3-dev@htdig.org Subject: [htdig3-dev] Re: Patch to get rid of 18% of db.words.db* List: htdig3-dev@sob.htdig.org
> From: Geoff Hutchison <ghutchis@wso.williams.edu> > Date: Tue Jan 19 20:59:25 CET 1999
> On Tue, 19 Jan 1999, Gilles Detillieux wrote: > > > really care) that makes the writing of the count out to the DB > conditional. > > Agreed. Is this OK with you Hans-Peter?
Sure. By the way, I think it needs to be a compile-time conditional (as in wrap-my-patches-in-ifdefs and a configure-option), since sizeof(WordRecord) is written and there's no note on the size of the records written to db.words.db. Or a much more elaborate patch is needed.
brgds, H-P From - Thu Feb 4 22:09:14 1999 Return-Path: <hans-peter.nilsson@axis.com> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id OAA00863 for <andrew@contigo.com>; Tue, 19 Jan 1999 14:22:04 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id OAA13328; Tue, 19 Jan 1999 14:30:53 -0800 (PST) From: Hans-Peter Nilsson <hans-peter.nilsson@axis.com> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A5079E.BeroList-2.5.5@sob.htdig.org> Date: Tue, 19 Jan 1999 23:21:19 +0100 CC: htdig3-dev@htdig.org In-Reply-To: <36A4E85B.BeroList-2.5.5@sob.htdig.org> Subject: [htdig3-dev] Re: Duplicate Keys in db.docs.index
* List: htdig3-dev@sob.htdig.org
> From: Geoff Hutchison <ghutchis@wso.williams.edu> > Date: Tue Jan 19 21:08:00 CET 1999
> If other people can confirm duplicate URLs in their databases, it's a > real problem--I'm considering it a showstopper for 3.1.0.
People with these problems might want to try my StringMatch patches, just don't forget to use version *3* of the second patch. :-( I think I saw one or two duplicates gone when I tested them. Not sure if it was at the same level of duplication, though.
brgds, H-P From - Thu Feb 4 22:09:15 1999 Return-Path: <hans-peter.nilsson@axis.com> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id OAA02586 for <andrew@contigo.com>; Tue, 19 Jan 1999 14:54:06 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id PAA13396; Tue, 19 Jan 1999 15:02:45 -0800 (PST) From: Hans-Peter Nilsson <hans-peter.nilsson@axis.com> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A50F16.BeroList-2.5.5@sob.htdig.org> Date: Tue, 19 Jan 1999 23:53:08 +0100 CC: hans-peter.nilsson@axis.com In-reply-to: <Pine.A41.4.02.9901191751580.30814-100000@strike.wu-wien.ac.at> Subject: [htdig3-dev] enum-assigment fixed (was: Re: DocumentRef::Serialize)
* List: htdig3-dev@sob.htdig.org
> Date: Tue, 19 Jan 1999 18:09:29 +0100 (MEZ) > From: Alexander Bergolth <leo@strike.wu-wien.ac.at>
> On Tue, 19 Jan 1999, Hans-Peter Nilsson wrote:
> > There's trickiness in that I cannot assign (as above) and cannot > > do a straightforward memcpy since I cannot assume the size of > > any specific enumerated type; it may vary with the enumerations.
Anyway, here's a patch that should fix this. It looks ugly, but I could think of no better and portable way (thinking of endian, compiler and type-size issues).
Tue Jan 19 23:44:49 1999 Hans-Peter Nilsson <hp@axis.se>
* htcommon/DocumentRef.cc (MEMCPY_ASSIGN, NUM_ASSIGN): New macros for assigning portably to some possibly-enum numeric type. (getnum): Use them.
Index: DocumentRef.cc =================================================================== RCS file: /opt/htdig/cvs/htdig3/htcommon/DocumentRef.cc,v retrieving revision 1.18 diff -p -c -r1.18 DocumentRef.cc *** DocumentRef.cc 1999/01/18 23:15:36 1.18 --- DocumentRef.cc 1999/01/19 22:44:14 *************** void DocumentRef::Deserialize(String &st *** 390,405 **** int x; String *str; #define getnum(type, in, var) \ if (type & CHARSIZE_MARKER_BIT) \ { \ ! var = (int) *(unsigned char *) in; \ in += sizeof(unsigned char); \ } \ else if (type & SHORTSIZE_MARKER_BIT) \ { \ ! var = (int) *(unsigned short int *) in; \ in += sizeof(unsigned short int); \ } \ else \ --- 384,422 ---- int x; String *str; + // There is a problem with getting a numeric value into a + // numeric unknown type that may be an enum (the other way + // around is simply by casting (int)). + // Supposedly the enum incarnates as a simple type, so we can + // just check the size and copy the bits. + #define MEMCPY_ASSIGN(to, from, type) \ + do { \ + type _tmp = (type) (from); \ + memcpy((char *) &(to), (char *) &_tmp, sizeof(to)); \ + } while (0) + + #define NUM_ASSIGN(to, from) \ + do { \ + if (sizeof(to) == sizeof(long int)) \ + MEMCPY_ASSIGN(to, from, long int); \ + else if (sizeof(to) == sizeof(int)) \ + MEMCPY_ASSIGN(to, from, int); \ + else if (sizeof(to) == sizeof(short int)) \ + MEMCPY_ASSIGN(to, from, short int); \ + else if (sizeof(to) == sizeof(char)) \ + MEMCPY_ASSIGN(to, from, char); \ + /* else fatal error here? */ \ + } while (0) #define getnum(type, in, var) \ if (type & CHARSIZE_MARKER_BIT) \ { \ ! NUM_ASSIGN(var, *(unsigned char *) in); \ in += sizeof(unsigned char); \ } \ else if (type & SHORTSIZE_MARKER_BIT) \ { \ ! NUM_ASSIGN(var, *(unsigned short int *) in); \ in += sizeof(unsigned short int); \ } \ else \
brgds, H-P From - Thu Feb 4 22:09:15 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id RAA12336 for <andrew@contigo.com>; Tue, 19 Jan 1999 17:38:14 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id RAA13959; Tue, 19 Jan 1999 17:46:58 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A5359F.BeroList-2.5.5@sob.htdig.org> Date: Tue, 19 Jan 1999 20:37:20 -0500 (EST) In-Reply-To: <36A4FEB3.BeroList-2.5.5@sob.htdig.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: [htdig3-dev] Re: [htdig3-dev] Re: Zlib compression
* List: htdig3-dev@sob.htdig.org
On Tue, 19 Jan 1999, Gilles Detillieux wrote:
> That sounds reasonable to me. I'd bet that the other fields are too small > to get decent compression anyway, but I may be wrong. In any case, if we > only compress/decompress the DocHead as needed, that would greatly cut > down the number of times we'd need to do that (once per document).
The only other candidate IMHO is the DocMetaDsc. In general, I doubt that this field would get much compression, but with Hans-Peter's short int patch and his url_alias patch, the others will be better.
So let's try switching compression to the DocHead methods and see how compression ratios and speed compare.
Another note is that we can consider (post 3.1.0) compressing db.wordlist.work. Of course, I'd like to scrap that file entirely, but that's another topic.
-Geoff From - Thu Feb 4 22:09:15 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id UAA20493 for <andrew@contigo.com>; Tue, 19 Jan 1999 20:49:03 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id UAA14697; Tue, 19 Jan 1999 20:58:04 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A5625F.BeroList-2.5.5@sob.htdig.org> by williams.edu (PMDF V5.1-10 #24595) with ESMTP id <0F5U00B2TDCL6J@williams.edu> for htdig3-dev@htdig.org; Tue, 19 Jan 1999 23:48:22 -0500 (EST) Date: Tue, 19 Jan 1999 23:49:57 -0400 MIME-version: 1.0 Content-type: text/plain; charset="us-ascii" Subject: [htdig3-dev] Status
* List: htdig3-dev@sob.htdig.org
Well, I've just about cleaned through my backlog. I apologize for not hitting the pile of patches from this weekend sooner.
I have three remaining patches to add.
* Marjolein's Translate entities patch--I'm adding this in largely as she submitted it to the list. It needs some work, but I like the idea (it also fixes that nagging problem with > ->> tag that we kludged a fix for in 3.1.0b3)
* Marjolein's Anchor patch--she submitted some changes to me that I have to merge in by hand.
* Hans-Peter's url_part_alias patch--I want to go to sleep now and I don't have time to check it.
I also recall Gilles sending me an e-mail about a ChangeLog entry that I wrote for him, but I can't find it. :-(
I should have these in tomorrow sometime. I'll make a snapshot when I do so.
-Geoff
This archive was generated by hypermail 2.0b3 on Thu Feb 04 1999 - 22:13:08 PST