[htdig3-dev] Re: [htdig3-dev] Using ${VAR}


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 21 Jan 1999 13:38:33 -0600 (CST)


* List: htdig3-dev@sob.htdig.org

According to Geoff Hutchison:
> Last week, Gilles noted a difference between the variable expansion
> documentation and the actual code--braces wouldn't be expanded.
>
> Now I may be naive, but wouldn't this fix the problem? This patch makes
> braces equivalent to parentheses.

Seems like it should do the trick! The only thing is it will also allow
improper nesting. E.g. ${VAR) or $(VAR}. No big deal, and a lot easier
than adding two or three extra states to deal with the braces. If you
add this patch, then you should also add one of the two following patches
to my wrapper enhancement, so it allows braces as well.

This one allows improper nesting:
--- ./htsearch/Display.cc.wrapper3 Mon Jan 18 17:01:34 1999
+++ ./htsearch/Display.cc Thu Jan 21 13:17:27 1999
@@ -222,8 +222,10 @@
                     header = h;
                     p[-1] = '\0';
                 }
- else if (p > h+1 && p[-1] == '(' && p[-2] == '$' &&
- p[strlen(wrap_sepr)] == ')')
+ else if (p > h+1 && p[-2] == '$' &&
+ (p[-1] == '(' || p[-1] == '{') &&
+ (p[strlen(wrap_sepr)] == ')' ||
+ p[strlen(wrap_sepr)] == '}'))
                 {
                     footer = p + strlen(wrap_sepr) + 1;
                     header = h;

This one does not allow improper nesting:
--- ./htsearch/Display.cc.wrapper3 Mon Jan 18 17:01:34 1999
+++ ./htsearch/Display.cc Thu Jan 21 13:15:00 1999
@@ -222,8 +222,9 @@
                     header = h;
                     p[-1] = '\0';
                 }
- else if (p > h+1 && p[-1] == '(' && p[-2] == '$' &&
- p[strlen(wrap_sepr)] == ')')
+ else if (p > h+1 && p[-2] == '$' &&
+ (p[-1] == '(' && p[strlen(wrap_sepr)] == ')' ||
+ p[-1] == '{' && p[strlen(wrap_sepr)] == '}'))
                 {
                     footer = p + strlen(wrap_sepr) + 1;
                     header = h;

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:15 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id LAA16064
	for <andrew@contigo.com>; Thu, 21 Jan 1999 11:46:03 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id LAA31098;
	Thu, 21 Jan 1999 11:56:00 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36A78651.BeroList-2.5.5@sob.htdig.org>
Date: Thu, 21 Jan 1999 13:45:23 -0600 (CST)
In-Reply-To: <36A730E8.BeroList-2.5.5@sob.htdig.org> from "Geoff Hutchison" at Jan 21, 99 08:40:31 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: [htdig3-dev] Re: [htdig3-dev] Finally....

* List: htdig3-dev@sob.htdig.org

According to Geoff Hutchison: > I finally finished off my backlog of patches. So I made a CVS snapshot of > the "post feature-freeze" tree. At this point, I'd like to start focusing > on bugs, though I'm holding two possibilities open to communal vote: > > * A patch from me to move zlib compression into separate files. It will > then only be called from accesses to DocHead.

Seems to me this would really boost performance of the compression stuff, so I think it's worth a shot.

> * A patch from me or Hans-Peter to add his common_url_parts and > url_part_aliases to htmerge/docs.cc > > And finally... Hans-Peter and I weren't sure about the naming of the option > for his patch. He included it as "url_part_aliases" for now, but I prefer > the less accurate "url_aliases"

I can't really comment as I'm not familiar with this patch. I tend to prefer the more accurate, especially if common_url_parts & url_part_aliases are interrelated somehow.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:15 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id OAA24306
	for <andrew@contigo.com>; Thu, 21 Jan 1999 14:51:16 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id PAA31516;
	Thu, 21 Jan 1999 15:01:08 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36A7B1B8.BeroList-2.5.5@sob.htdig.org>
Date: Thu, 21 Jan 1999 16:50:22 -0600 (CST)
In-Reply-To: <36A730E8.BeroList-2.5.5@sob.htdig.org> from "Geoff Hutchison" at Jan 21, 99 08:40:31 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: [htdig3-dev] patch for config file include statement

* List: htdig3-dev@sob.htdig.org

Hi everyone. It's always seemed odd to me that htdig's config file didn't allow you to include other files. This would be really handy for sites that maintain multiple configs, but want to centralize the common stuff in one file. So, I've added the feature (to the 011799 source on my system), and it seems to work like a charm. The patch is below. I know it's past the feature freeze, but it's a fairly simple addition. (As John Cleese would say, "it's wah-fer thin.")

--- ./htlib/Configuration.cc.include Sat Jan 16 21:21:20 1999 +++ ./htlib/Configuration.cc Thu Jan 21 15:38:13 1999 @@ -339,6 +339,25 @@ len--; } + if (mystrcasecmp(name, "include") == 0) + { + ParsedString ps(value); + String str(ps.get(dict)); + if (str[0] != '/') // Given file name not fully qualified + { + str = filename; // so strip dir. name from current one + len = str.lastIndexOf('/') + 1; + if (len > 0) + str.chop(str.length() - len); + else + str = ""; // No slash in current filename + str << ps.get(dict); + } + Read(str.get()); + line = 0; + continue; + } + Add(name, value); line = 0; }

Example:

include: common.conf search_results_wrapper: ${common_dir}/mywrapper.html

If the given file name is not fully qualified, it's taken relative to the directory in which the config file that uses the include statement is found. Variable expansion is permitted in the file name. Multiple includes, and nested includes are permitted as well.

I was going to document it, but I couldn't decide whether I should describe it as another attribute, in attrs.html, or just add a note in cf_general.html. Suggestions? (It's not really an attribute, even though it looks like one, because it's only allowed in config files, and it's interpreted right away.)

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:15 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id QAA27738
	for <andrew@contigo.com>; Thu, 21 Jan 1999 16:12:10 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id QAA31719;
	Thu, 21 Jan 1999 16:22:10 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36A7C4B4.BeroList-2.5.5@sob.htdig.org>
Date: Thu, 21 Jan 1999 17:32:27 -0600 (CST)
In-Reply-To: <36A7B1B8.BeroList-2.5.5@sob.htdig.org> from "Gilles Detillieux" at Jan 21, 99 04:50:22 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: [htdig3-dev] patch for config file include statement (docs)

* List: htdig3-dev@sob.htdig.org

OK, I made up my own mind. Here's the patch for htdoc/cf_general.html:

--- ./htdoc/cf_general.html.include2 Thu Dec 10 21:26:25 1998 +++ ./htdoc/cf_general.html Thu Jan 21 17:27:03 1999 @@ -56,6 +56,23 @@ the configuration file, it will use the default value which is defined in <tt>htcommon/defaults.cc</tt>. </p> + <p> + A configuration file can include another file, by using the special + &lt;name&gt;, <tt>include</tt>. The &lt;value&gt; is taken as + the file name of another configuration file to be read in at + this point. If the given file name is not fully qualified, it is + taken relative to the directory in which the current configuration + file is found. Variable expansion is permitted in the file name. + Multiple include statements, and nested includes are also permitted. + </p> + <dl> + <dt> + <em>Example:</em> + </dt> + <dd> + <tt>include: common.conf</tt> + </dd> + </dl> <hr noshade size="4"> <p> <!-- hhmts start --> Last modified: Wed Jan 1 20:46:34 PST

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:15 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id SAA01560
	for <andrew@contigo.com>; Thu, 21 Jan 1999 18:48:21 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id SAA32255;
	Thu, 21 Jan 1999 18:58:23 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36A7E952.BeroList-2.5.5@sob.htdig.org>
Date: Thu, 21 Jan 1999 21:47:39 -0500 (EST)
In-Reply-To: <36A7B1B8.BeroList-2.5.5@sob.htdig.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: [htdig3-dev] Re: [htdig3-dev] patch for config file include statement

* List: htdig3-dev@sob.htdig.org

On Thu, 21 Jan 1999, Gilles Detillieux wrote:

> and it seems to work like a charm. The patch is below. I know it's past > the feature freeze, but it's a fairly simple addition. (As John Cleese > would say, "it's wah-fer thin.")

I'm opening this up to the first post-freeze vote. Either send me a message with your vote, or send your vote and comments to the list. +1=for, -1=against, 0=abstain

Geoff +1 (Giles +1)

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:15 1999 Return-Path: <leo@strike.wu-wien.ac.at> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id CAA15440 for <andrew@contigo.com>; Fri, 22 Jan 1999 02:07:11 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id CAA00899; Fri, 22 Jan 1999 02:17:22 -0800 (PST) From: Alexander Bergolth <leo@strike.wu-wien.ac.at> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A85036.BeroList-2.5.5@sob.htdig.org> Date: Fri, 22 Jan 1999 11:06:27 +0100 (MEZ) documents In-Reply-To: <36A6719A.BeroList-2.5.5@sob.htdig.org> by williams.edu (PMDF V5.1-10 #24595) with ESMTP id <0F5V00E2LUWJGB@williams.edu> for htdig3-dev@htdig.org; Wed, 20 Jan 1999 19:05:11 -0500 (EST) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: [htdig3-dev] Re: [htdig3-dev] Re: [htdig3-dev] Re: StringMatch and duplicate

* List: htdig3-dev@sob.htdig.org

On Wed, 20 Jan 1999, Geoff Hutchison wrote:

> At 6:11 PM -0400 1/20/99, Gilles Detillieux wrote: > > >A few trace prints in htmerge/docs.cc revealed the source of the 9 extra > >documents. These were 9 documents that were disallowed by robots.txt, > >which were deleted from the DB, because they had no DocHead, but because > >of a missing "else", they were still indexed and counted. Here's the fix: > > I don't know if I believe it. That seemed to do it... After patching, > recompiling and re-running htmerge, I get: > > htmerge: Total documents: 58193 > htmerge: Total doc db size (in K): 330586 > > No complaints here. Leo, are you still seeing duplicate URLs?

Yes. :(

OK, maybe I did somethin wrong, I'll explain the test procedure:

I'm using db_dump from Berkeley DB to print the contents of the docs.index file and extract the from this file using the following perl script: ---------- snipp! ---------- #!/usr/local/bin/perl

while ($_ ne "HEADER=END\n") { $_= <>; }

while (<>) { $_= <>; print; } ---------- snipp! ----------

db_dump -p wu.docs.index | dump-docs.pl > wu.index.1999-01-22

sort wu-index.1999-01-22 > wu-index.1999-01-22-sorted

wc -l wu-index.1999-01-22-sorted 125273 wu-index.1999-01-22-sorted

uniq -c wu-index.1999-01-22-sorted > wu-index.1999-01-22-uniq

wc -l wu-index.1999-01-22-uniq 78695 wu-index.1999-01-22-uniq

:(

- Leo -

P.S.: Multiple entries are distributed as follows:

#docs appearances 60050 1 12042 2 2109 3 587 4 136 5 146 6 724 7 906 8 1431 9 502 10 52 11 9 12 1 13

----------------------------------------------------------------------- Alexander (Leo) Bergolth leo@leo.wu-wien.ac.at WU-Wien - Zentrum fuer Informatikdienste http://leo.wu-wien.ac.at Info Center In a world without walls and fences, who needs windows and gates?

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:15 1999 Return-Path: <leo@strike.wu-wien.ac.at> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id DAA17352 for <andrew@contigo.com>; Fri, 22 Jan 1999 03:08:17 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id DAA01062; Fri, 22 Jan 1999 03:18:30 -0800 (PST) From: Alexander Bergolth <leo@strike.wu-wien.ac.at> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A85E87.BeroList-2.5.5@sob.htdig.org> Date: Fri, 22 Jan 1999 12:07:33 +0100 (MEZ) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: [htdig3-dev] Duplicate entries in docs.index

* List: htdig3-dev@sob.htdig.org

Hi!

I just tried 2 runs with servers that produced duplicate URLs in a test-database but I didn't remove the files after the first run.

I digged using the following options: htdig/htdig -v -i -t -s -c /scratch/leo/htdig/htdig/conf/test.conf htmerge/htmerge -v -s -c /scratch/leo/htdig/htdig/conf/test.conf

After the second run, I found one document from the first run in the docs.index file that was't removed correctly.

The .docs file that htdig produces is OK, so htmerge must be the problem.

Cheers, Leo

P.S.: I did a third run without removing the databases using the first server again (having a smaller URL count than the second) and 340 of 411 URLs remained from the previous run!

----------------------------------------------------------------------- Alexander (Leo) Bergolth leo@leo.wu-wien.ac.at WU-Wien - Zentrum fuer Informatikdienste http://leo.wu-wien.ac.at Info Center In a world without walls and fences, who needs windows and gates?

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:15 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id SAA28338 for <andrew@contigo.com>; Fri, 22 Jan 1999 18:01:29 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id SAA04702; Fri, 22 Jan 1999 18:11:41 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A92FF5.BeroList-2.5.5@sob.htdig.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 22 Jan 1999 20:54:10 -0400 Subject: [htdig3-dev] Removing $Log$ from source files

* List: htdig3-dev@sob.htdig.org

Hi,

Hans-Peter and I were discussing various schemes and intrigues for future ht://Dig development. He suggested removing the CVS messages at the tops of all the source files. I agreed, though it would obviously require going through all the files and ripping them out.

Aside from that drawback, it would probably make the patch a bit bigger (but the tar file smaller). It doesn't change the source itself much.

So... Another vote, same deal as last time, either send me a message personally or to the list.

Geoff +1 (Hans-Peter +1)

BTW, the results of the previous vote were positive. Gilles patch goes in tonight. Have a great weekend everyone, -Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:15 1999 Return-Path: <leo@strike.wu-wien.ac.at> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id DAA13084 for <andrew@contigo.com>; Sat, 23 Jan 1999 03:32:54 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id DAA05743; Sat, 23 Jan 1999 03:38:24 -0800 (PST) From: Alexander Bergolth <leo@strike.wu-wien.ac.at> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36A9B4C0.BeroList-2.5.5@sob.htdig.org> X-Sender: leo@strike.wu-wien.ac.at X-Mailer: QUALCOMM Windows Eudora Pro Version 4.1 Date: Sat, 23 Jan 1999 12:28:00 +0100 In-Reply-To: <36A85036.BeroList-2.5.5@sob.htdig.org> References: <36A6719A.BeroList-2.5.5@sob.htdig.org> <0F5V00E2LUWJGB@williams.edu> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: [htdig3-dev] Re: StringMatch and duplicate

* List: htdig3-dev@sob.htdig.org

At 11:06 22.01.99 , Alexander Bergolth wrote: >db_dump -p wu.docs.index | dump-docs.pl > wu.index.1999-01-22 > >sort wu-index.1999-01-22 > wu-index.1999-01-22-sorted > >wc -l wu-index.1999-01-22-sorted > 125273 wu-index.1999-01-22-sorted > >uniq -c wu-index.1999-01-22-sorted > wu-index.1999-01-22-uniq > >wc -l wu-index.1999-01-22-uniq > 78695 wu-index.1999-01-22-uniq

Tonight I removed the old docs.index file before doing an initial dig and now the urls are unique:

speth08:/<1>htdig/db > wc -l wu-index.1999-01-23-sorted 75849 wu-index.1999-01-23-sorted speth08:/<1>htdig/db > uniq -c wu-index.1999-01-23-sorted | wc -l 75849

Looks like some old URLs are not deleted from this database...

Btw. I noticed a significant speed decrease of the current CVS version in comparison to the CVS-tree from Dec 27th.

The last initial dig on Jan 15th completed in 3:45 hours with a max_doc_size of 1MB, the current Version took 4:51 hours to complete with a max_doc_size of 512k.

I tried both versions several times and the run-time didn't vary more than 10 minutes. There are currently no known or noticable network problems. (We even changed the ATM interface yesterday.)

Does anyone have similar experiences?

Fri Jan 15 03:07:00 MEZ 1999: htdig started, args: -t -i Fri Jan 15 06:52:07 MEZ 1999: htdig completed Fri Jan 15 07:29:44 MEZ 1999: htmerge completed

htdig: accounting.wu-wien.ac.at:80 411 documents htdig: challenger.wu-wien.ac.at:80 66 documents htdig: empire.wu-wien.ac.at:80 1183 documents htdig: fgr.wu-wien.ac.at:80 286 documents htdig: force.wu-wien.ac.at:80 355 documents htdig: indi.wu-wien.ac.at:80 266 documents htdig: miss.wu-wien.ac.at:80 16404 documents htdig: wigeoweb.wu-wien.ac.at:80 86 documents htdig: www.wu-wien.ac.at:80 59152 documents htdig: wwwai.wu-wien.ac.at:80 3501 documents htdig: wwwi.wu-wien.ac.at:80 6009 documents htdig: zas.wu-wien.ac.at:80 60 documents htmerge: Total documents: 79804 htmerge: Total doc db size (in K): 747715

-rw-rw-r-- 1 htdig harvest 198603776 Jan 15 07:29 /var/htdig/db/wu.docdb -rw-rw-r-- 1 htdig harvest 128932544 Jan 15 06:51 /var/htdig/db/wu.docs -rw-rw-r-- 1 htdig harvest 19973120 Jan 15 07:29 /var/htdig/db/wu.docs.index -rw-rw-r-- 1 htdig harvest 303634377 Jan 15 07:23 /var/htdig/db/wu.wordlist -rw-rw-r-- 1 htdig harvest 267509760 Jan 15 07:23 /var/htdig/db/wu.words.db

Sat Jan 23 03:07:00 MEZ 1999: htdig started, args: -t -i Sat Jan 23 07:58:21 MEZ 1999: htdig completed Sat Jan 23 08:30:03 MEZ 1999: htmerge completed

htdig: accounting.wu-wien.ac.at:80 412 documents htdig: challenger.wu-wien.ac.at:80 66 documents htdig: empire.wu-wien.ac.at:80 1188 documents htdig: fgr.wu-wien.ac.at:80 296 documents htdig: force.wu-wien.ac.at:80 338 documents htdig: indi.wu-wien.ac.at:80 268 documents htdig: miss.wu-wien.ac.at:80 11934 documents htdig: wigeoweb.wu-wien.ac.at:80 83 documents htdig: www.wu-wien.ac.at:80 60239 documents htdig: wwwai.wu-wien.ac.at:80 3398 documents htdig: wwwi.wu-wien.ac.at:80 6136 documents htdig: zas.wu-wien.ac.at:80 60 documents htmerge: Total documents: 75865 htmerge: Total doc db size (in K): 572179

-rw-rw-r-- 1 htdig harvest 173246464 Jan 23 08:29 /var/htdig/db/wu.docdb -rw-rw-r-- 1 htdig harvest 119229661 Jan 23 07:58 /var/htdig/db/wu.docs -rw-rw-r-- 1 htdig harvest 10503168 Jan 23 08:29 /var/htdig/db/wu.docs.index -rw-rw-r-- 1 htdig harvest 286257559 Jan 23 08:25 /var/htdig/db/wu.wordlist -rw-rw-r-- 1 htdig harvest 256347136 Jan 23 08:25 /var/htdig/db/wu.words.db ----------------------------------------------------------------------- Alexander (Leo) Bergolth leo@leo.wu-wien.ac.at WU-Wien - Zentrum fuer Informatikdienste http://leo.wu-wien.ac.at Info Center In a world without walls and fences, who needs windows and gates?

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:15 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id JAA24439 for <andrew@contigo.com>; Sat, 23 Jan 1999 09:58:26 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id KAA06629; Sat, 23 Jan 1999 10:09:09 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36AA1051.BeroList-2.5.5@sob.htdig.org> In-Reply-To: <36A85E87.BeroList-2.5.5@sob.htdig.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 23 Jan 1999 12:56:22 -0400 Subject: [htdig3-dev] Re: [htdig3-dev] Duplicate entries in docs.index

* List: htdig3-dev@sob.htdig.org

At 7:07 AM -0400 1/22/99, Alexander Bergolth wrote: >After the second run, I found one document from the first run in the >docs.index file that was't removed correctly. > >The .docs file that htdig produces is OK, so htmerge must be the problem.

>P.S.: I did a third run without removing the databases using the first >server again (having a smaller URL count than the second) and 340 of 411 >URLs remained from the previous run!

Thanks Leo, I think I have it nailed now. This is similar to the bug with the db.words.db (the word version of docs.index) that we nailed for 3.1.0b3.

The fix is easy and it explains why I'm not seeing it. I do all of my digs with -a and I never keep the db.docs.index.work file. So I essentially do what you did during testing--I remove the file before doing a dig.

So that's the fix! We unlink the db.docs.index file before htmerge does anything. This way we generate a clean version, free of duplicates. I'll put it in the tree tonight. I bet it will be 1-2 lines to fix :-(.

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:15 1999 Return-Path: <dgautheron@magic.fr> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id NAA30260 for <andrew@contigo.com>; Sat, 23 Jan 1999 13:04:53 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id NAA06895; Sat, 23 Jan 1999 13:15:45 -0800 (PST) From: Didier Gautheron <dgautheron@magic.fr> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36AA3C04.BeroList-2.5.5@sob.htdig.org> Sender: didier@venise.magic.fr Date: Sat, 23 Jan 1999 20:27:03 +0000 X-Mailer: Mozilla 4.05 [en] (X11; I; Linux 2.0.33 i686) MIME-Version: 1.0 References: <36A6719A.BeroList-2.5.5@sob.htdig.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Subject: [htdig3-dev] Re: [htdig3-dev] Re: StringMatch and duplicate

* List: htdig3-dev@sob.htdig.org

Alexander Bergolth wrote: > The last initial dig on Jan 15th completed in 3:45 hours with a > max_doc_size of 1MB, the current Version took 4:51 hours to complete with a > max_doc_size of 512k. > > I tried both versions several times and the run-time didn't vary more than > 10 minutes. There are currently no known or noticable network problems. (We > even changed the ATM interface yesterday.) > > Does anyone have similar experiences? Could you rerun the old version may be with profiling on and or with time? > > Fri Jan 15 03:07:00 MEZ 1999: htdig started, args: -t -i > Fri Jan 15 06:52:07 MEZ 1999: htdig completed > Fri Jan 15 07:29:44 MEZ 1999: htmerge completed > > htmerge: Total documents: 79804 > htmerge: Total doc db size (in K): 747715 > > Sat Jan 23 03:07:00 MEZ 1999: htdig started, args: -t -i > Sat Jan 23 07:58:21 MEZ 1999: htdig completed > Sat Jan 23 08:30:03 MEZ 1999: htmerge completed > > htdig: zas.wu-wien.ac.at:80 60 documents > htmerge: Total documents: 75865 > htmerge: Total doc db size (in K): 572179 And you have smaller db :(

Didier

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:15 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id RAA07258 for <andrew@contigo.com>; Sat, 23 Jan 1999 17:56:07 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id SAA08688; Sat, 23 Jan 1999 18:07:07 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36AA8052.BeroList-2.5.5@sob.htdig.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 23 Jan 1999 20:55:14 -0400 Subject: [htdig3-dev] What's left for 3.1.0

* List: htdig3-dev@sob.htdig.org

Ok, now that we're essentially feature-frozen, I'll list some remaining work before 3.1.0 goes out the door and I'll be taking a break. If anyone has questions or can think of something I've forgotten, let me know.

REPORTED SHOWSTOPPERS: * htdig loops forever when the server sends a message-length different from what's sent. * htdig coredumps when calling strftime (PR#81) * htsearch can coredump if a file in template_map doesn't exist * Add config option "omit_default_doc" to decide whether we strip off index.html (or local_default_doc) since some servers wreck havoc with this behavior.

OTHER BUGS: * URLs are translated to lowercase before stored in the database * Double slashes are eliminated even if they're part of a CGI query string. * The characters '-")|' when seen in <title> tags show up in excerpts. * Problems with valid_punctuation and excerpt hilighting (i.e. I'll isn't highlighted in excerpts)

ISSUES: * Remove $Log$ from source files (Geoff +1, Hans-Peter +1, Joe +1) * Fix SGMLEntities to use StringMatch * Move DocumentRef compression to DocHead methods * Conditional elimination of word counts in WordRecord and db.wordlist * Run db merge code with sort -m for performance boost * If a server ignores the If-Modified-Since header, we should compare the timestamp with DocTime() to see if we have the current version of the doc

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:15 1999 Return-Path: <dgautheron@magic.fr> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id QAA22438 for <andrew@contigo.com>; Sun, 24 Jan 1999 16:53:38 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id RAA15361; Sun, 24 Jan 1999 17:04:56 -0800 (PST) From: Didier Gautheron <dgautheron@magic.fr> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36ABC34C.BeroList-2.5.5@sob.htdig.org> Sender: didier@venise.magic.fr Date: Mon, 25 Jan 1999 01:07:58 +0000 X-Mailer: Mozilla 4.05 [en] (X11; I; Linux 2.0.33 i686) MIME-Version: 1.0 References: <36A6719A.BeroList-2.5.5@sob.htdig.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Subject: [htdig3-dev] Re: [htdig3-dev] Re: StringMatch and duplicate

* List: htdig3-dev@sob.htdig.org

Alexander Bergolth wrote: > > * List: htdig3-dev@sob.htdig.org > > At 11:06 22.01.99 , Alexander Bergolth wrote: > >db_dump -p wu.docs.index | dump-docs.pl > wu.index.1999-01-22 > > > >sort wu-index.1999-01-22 > wu-index.1999-01-22-sorted > > > >wc -l wu-index.1999-01-22-sorted > > 125273 wu-index.1999-01-22-sorted > > > >uniq -c wu-index.1999-01-22-sorted > wu-index.1999-01-22-uniq > > > >wc -l wu-index.1999-01-22-uniq > > 78695 wu-index.1999-01-22-uniq > > Tonight I removed the old docs.index file before doing an initial dig and > now the urls are unique: > > speth08:/<1>htdig/db > wc -l wu-index.1999-01-23-sorted > 75849 wu-index.1999-01-23-sorted > speth08:/<1>htdig/db > uniq -c wu-index.1999-01-23-sorted | wc -l > 75849 > > Looks like some old URLs are not deleted from this database... > > Btw. I noticed a significant speed decrease of the current CVS version in > comparison to the CVS-tree from Dec 27th. > > The last initial dig on Jan 15th completed in 3:45 hours with a > max_doc_size of 1MB, the current Version took 4:51 hours to complete with a > max_doc_size of 512k. > > I tried both versions several times and the run-time didn't vary more than > 10 minutes. There are currently no known or noticable network problems. (We > even changed the ATM interface yesterday.) > > Does anyone have similar experiences? The problen is in HTML::parse the skip_start stuff have to be declare static and move out of the loop. Didier ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:16 1999 Return-Path: <grdetil@scrc.umanitoba.ca> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id LAA28241 for <andrew@contigo.com>; Mon, 25 Jan 1999 11:18:38 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id LAA24188; Mon, 25 Jan 1999 11:30:03 -0800 (PST) From: Gilles Detillieux <grdetil@scrc.umanitoba.ca> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36ACC657.BeroList-2.5.5@sob.htdig.org> Date: Mon, 25 Jan 1999 13:17:17 -0600 (CST) In-Reply-To: <36ABC34C.BeroList-2.5.5@sob.htdig.org> from "Didier Gautheron" at Jan 25, 99 01:07:58 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: [htdig3-dev] Re: StringMatch and duplicate

* List: htdig3-dev@sob.htdig.org

According to Didier Gautheron: > Alexander Bergolth wrote: > > Btw. I noticed a significant speed decrease of the current CVS version in > > comparison to the CVS-tree from Dec 27th. > > > > The last initial dig on Jan 15th completed in 3:45 hours with a > > max_doc_size of 1MB, the current Version took 4:51 hours to complete with a > > max_doc_size of 512k. > > > > I tried both versions several times and the run-time didn't vary more than > > 10 minutes. There are currently no known or noticable network problems. (We > > even changed the ATM interface yesterday.) > > > > Does anyone have similar experiences? > The problen is in HTML::parse the skip_start stuff have to be declare > static and move out of the loop. > Didier

I've noticed a few changes recently where config attribute lookups are buried deep in loops. I can understand the desire to keep changes localised, but that must be balanced against a desire to keep performance good. Any config lookups should be considered expensive (take a look at what Configuration::Find() does!), and moved out of loops to static variables, as Didier suggested.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id LAA30127
	for <andrew@contigo.com>; Mon, 25 Jan 1999 11:55:47 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA25705;
	Mon, 25 Jan 1999 12:07:47 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36ACCF14.BeroList-2.5.9@sob.htdig.org>
Date: Mon, 25 Jan 1999 14:55:06 -0500 (EST)
In-Reply-To: <36ACC657.BeroList-2.5.5@sob.htdig.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: Re: [htdig3-dev] Re: StringMatch and duplicate

On Mon, 25 Jan 1999, Gilles Detillieux wrote:

> I've noticed a few changes recently where config attribute lookups > are buried deep in loops. I can understand the desire to keep changes > localised, but that must be balanced against a desire to keep performance > good. Any config lookups should be considered expensive (take a look > at what Configuration::Find() does!), and moved out of loops to static > variables, as Didier suggested.

Agreed. I took the two Didier mentioned out of the loop in HTML.cc. If anyone wants to undertake this, I'll be happy to see it.

On a similar note, someone asked what I will accept during feature-freeze. Bug-fixes... But in addition to bug reports (on the list and to the bug database), I include most performance improvements and cleanup efforts as bug-fixes. So if someone wants to take variables out of loops for any files, I'll take the patches.

On that note, I haven't heard any complaints against removing $Log$ from files. If I don't hear any before tonight, I'm going to remove them from all files. (That's a big patch!) This would go in the "cleanup effort" realm.

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:16 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from wso.williams.edu (wso.williams.edu [137.165.37.207]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id LAA30191 for <andrews@contigo.com>; Mon, 25 Jan 1999 11:57:08 -0800 Received: from localhost (ghutchis@localhost) by wso.williams.edu (8.9.2/8.9.2/Debian/GNU) with SMTP id OAA10003 for <andrews@contigo.com>; Mon, 25 Jan 1999 14:57:07 -0500 (EST) Date: Mon, 25 Jan 1999 14:57:06 -0500 (EST) From: Geoff Hutchison <ghutchis@wso.williams.edu> To: andrews@contigo.com Subject: Re: [htdig3-dev] Re: StringMatch and duplicate (fwd) Message-ID: <Pine.LNX.3.96.990125145543.9643B-100000@wso.williams.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII

Andrew,

This was sent using the source-compiled BeroList 2.5.9. Note it went through, it doesn't have multiple subject prefixes and it doesn't have the List: line. Should we move htdig@htdig.org now?

BTW, perhaps we want to clean out the old message archive before we move it. :-)

-Geoff

---------- Forwarded message ---------- Date: Mon, 25 Jan 1999 14:55:06 -0500 (EST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Subject: Re: [htdig3-dev] Re: StringMatch and duplicate

On Mon, 25 Jan 1999, Gilles Detillieux wrote:

> I've noticed a few changes recently where config attribute lookups > are buried deep in loops. I can understand the desire to keep changes > localised, but that must be balanced against a desire to keep performance > good. Any config lookups should be considered expensive (take a look > at what Configuration::Find() does!), and moved out of loops to static > variables, as Didier suggested.

Agreed. I took the two Didier mentioned out of the loop in HTML.cc. If anyone wants to undertake this, I'll be happy to see it.

On a similar note, someone asked what I will accept during feature-freeze. Bug-fixes... But in addition to bug reports (on the list and to the bug database), I include most performance improvements and cleanup efforts as bug-fixes. So if someone wants to take variables out of loops for any files, I'll take the patches.

On that note, I haven't heard any complaints against removing $Log$ from files. If I don't hear any before tonight, I'm going to remove them from all files. (That's a big patch!) This would go in the "cleanup effort" realm.

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:16 1999 Return-Path: <andrews@contigo.com> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id MAA31548 for <andrew@contigo.com>; Mon, 25 Jan 1999 12:15:18 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA25787; Mon, 25 Jan 1999 12:27:16 -0800 (PST) From: Andrew Scherpbier <andrews@contigo.com> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36ACD3A5.BeroList-2.5.9@sob.htdig.org> Sender: turtle@contigo.com Date: Mon, 25 Jan 1999 12:14:35 -0800 Organization: Contigo Software X-Mailer: Mozilla 4.5 [en] (X11; I; Linux 2.1.131 i686) X-Accept-Language: en MIME-Version: 1.0 References: <36ACCF14.BeroList-2.5.9@sob.htdig.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Subject: Re: [htdig3-dev] Re: StringMatch and duplicate

Geoff Hutchison wrote: > > On that note, I haven't heard any complaints against removing $Log$ from > files. If I don't hear any before tonight, I'm going to remove them from > all files. (That's a big patch!) This would go in the "cleanup effort" > realm.

The following is my $.02:

I have to appologize for originally adding the $Log$ crap; I claim inexperience at the time I added those darn things... :-) Anyway, we went through the same thing at Contigo Software (again, it was my fault!) but we decided to not remove them all at once. Instead, we decided that anyone who made a change to a file would be responsible for taking out the $Log$ stuff. That way, we didn't have to change all files (our software contained at least an order of magnitude more files than ht://Dig...) but it would still get done eventually. I went through a similar excersize with ht://Dig when I reformatted all source files to use 4 space indents instead of tabs that were assumed to be 4 spaces. (I did this long before the CVS tree was public...)

So, my suggestion would be to *not* go through all files at once, but instead do it incrementally.

-- 
Andrew Scherpbier <andrews@contigo.com>
Contigo Software <http://www.contigo.com/>
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from wso.williams.edu (wso.williams.edu [137.165.37.207])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id MAA32582
	for <andrews@contigo.com>; Mon, 25 Jan 1999 12:35:19 -0800
Received: from localhost (ghutchis@localhost)
	by wso.williams.edu (8.9.2/8.9.2/Debian/GNU) with SMTP id PAA12155
	for <andrews@contigo.com>; Mon, 25 Jan 1999 15:35:18 -0500 (EST)
Date: Mon, 25 Jan 1999 15:35:18 -0500 (EST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
To: Andrew Scherpbier <andrews@contigo.com>
Subject: Re: [htdig3-dev] Re: StringMatch and duplicate (fwd)
In-Reply-To: <36ACD134.A055DFF5@contigo.com>
Message-ID: <Pine.LNX.3.96.990125153451.11855B-100000@wso.williams.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

On Mon, 25 Jan 1999, Andrew Scherpbier wrote:

> Okay. Do you want to do the honors of sending the message to the folks on the > list?

Done.

> When that is done, I'll send a message to the sdsu folks about having them > redirect to htdig.org.

> Most definately!!!! :-)

Done. :-)

Cheers, -Geoff From - Thu Feb 4 22:09:16 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id MAA00403 for <andrew@contigo.com>; Mon, 25 Jan 1999 12:40:13 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA25948; Mon, 25 Jan 1999 12:52:09 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36ACD97F.BeroList-2.5.9@sob.htdig.org> Date: Mon, 25 Jan 1999 15:39:28 -0500 (EST) In-Reply-To: <36ACD3A5.BeroList-2.5.9@sob.htdig.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: [htdig3-dev] $Log$ was (Re: StringMatch and duplicate)

On Mon, 25 Jan 1999, Andrew Scherpbier wrote:

> So, my suggestion would be to *not* go through all files at once, but instead > do it incrementally.

Ooo. Sounds like a nice plan. This way we also clean up the longest $Log$ messages first.

BTW, I probably would have made the move to include $Log$ too. It *sounds* like a great idea, especially when updates are infrequent. It's only looking at say, HTML.cc or Display.cc right now that it shows.

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:16 1999 Return-Path: <andrew@contigo.com> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id MAA00765 for <andrew@contigo.com>; Mon, 25 Jan 1999 12:47:07 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA25970; Mon, 25 Jan 1999 12:59:04 -0800 (PST) From: Andrew Scherpbier <andrew@contigo.com> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36ACDB19.BeroList-2.5.9@sob.htdig.org> Sender: turtle@contigo.com Date: Mon, 25 Jan 1999 12:46:23 -0800 Organization: Contigo Software X-Mailer: Mozilla 4.5 [en] (X11; I; Linux 2.1.131 i686) X-Accept-Language: en MIME-Version: 1.0 References: <36ACD97F.BeroList-2.5.9@sob.htdig.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Subject: Re: [htdig3-dev] $Log$ was (Re: StringMatch and duplicate)

Geoff Hutchison wrote: > > On Mon, 25 Jan 1999, Andrew Scherpbier wrote: > > > So, my suggestion would be to *not* go through all files at once, but instead > > do it incrementally. > > Ooo. Sounds like a nice plan. This way we also clean up the longest $Log$ > messages first. > > BTW, I probably would have made the move to include $Log$ too. It *sounds* > like a great idea, especially when updates are infrequent. It's only > looking at say, HTML.cc or Display.cc right now that it shows. > > -Geoff

We (Contigo Software) actually went through an intermediate step: we moved the $Log$ to the bottom of files. This made it somewhat less annoying, but when these logs started to get *much* longer than the actual code, we decided that enough was enough and started to remove them. :-) Our current template for new Java source files includes just $Id$ at the top. (plus javadoc style docs, of course!)

-- 
Andrew Scherpbier <andrews@contigo.com>
Contigo Software <http://www.contigo.com/>
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <leo@strike.wu-wien.ac.at>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id DAA00762
	for <andrew@contigo.com>; Tue, 26 Jan 1999 03:29:40 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id DAA30429;
	Tue, 26 Jan 1999 03:29:07 -0800 (PST)
From: Alexander Bergolth <leo@strike.wu-wien.ac.at>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36ADA713.BeroList-2.5.9@sob.htdig.org>
X-Sender: leo@strike.wu-wien.ac.at
X-Mailer: QUALCOMM Windows Eudora Pro Version 4.1 
Date: Tue, 26 Jan 1999 12:27:47 +0100
Cc: htdig3-dev@htdig.org
In-Reply-To: <36ABC34C.BeroList-2.5.5@sob.htdig.org>
References: <36A6719A.BeroList-2.5.5@sob.htdig.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Subject: Re: [htdig3-dev] Re: [htdig3-dev] Re: StringMatch and duplicate

At 02:07 25.01.99 , Didier Gautheron wrote: >Alexander Bergolth wrote: >> Btw. I noticed a significant speed decrease of the current CVS version in >> comparison to the CVS-tree from Dec 27th. >> Does anyone have similar experiences?

>The problen is in HTML::parse the skip_start stuff have to be declare >static and move out of the loop.

Thanx! Now it's dashing at warp speed again...

Cheers, Leo

----------------------------------------------------------------------- Alexander (Leo) Bergolth leo@leo.wu-wien.ac.at WU-Wien - Zentrum fuer Informatikdienste http://leo.wu-wien.ac.at Info Center In a world without walls and fences, who needs windows and gates?

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:16 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id MAA25178 for <andrew@contigo.com>; Tue, 26 Jan 1999 12:06:13 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA05324; Tue, 26 Jan 1999 12:06:08 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36AE2031.BeroList-2.5.9@sob.htdig.org> In-Reply-To: <36AE1773.BeroList-2.5.9@sob.htdig.org> References: <36AA8052.BeroList-2.5.5@sob.htdig.org> from "Geoff Hutchison" at Jan 23, 99 08:55:14 pm Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 26 Jan 1999 15:03:12 -0400 Subject: Re: [htdig3-dev] Re: What's left for 3.1.0

>According to Geoff Hutchison: >Hmmm. That's strange, as htdig doesn't even look at the Content-length >header when retrieving from the HTTP server. It just reads until the >read() request returns 0 bytes (an EOF). Maybe this particular server, >at M.I.T. according to the bugs DB, wasn't closing the socket properly?

That's what I was thinking. Since I don't have an actual address, I'm kinda stuck. I think our behavior should *avoid* the problem mentioned.

>> * htdig coredumps when calling strftime (PR#81) >which oddly has an address that's different than the mmap call - this leads >me to think that the memory corruption happened while processing the >zoneinfo file, so maybe he has a corrupt /usr/lib/zoneinfo/localtime?

Now that's a good point. I could understand the prior problems when we got back a NULL and sent it on its way to blow up in our faces. But that's not happening (and I have a conditional to prevent it).

>> * htsearch can coredump if a file in template_map doesn't exist >here. If the person who reported this problem can be persuaded to test >out the current snapshot or CVS tree, great, but otherwise I think this >problem is solved already.

I would tend to agree here. I included the remark simply because I thought it needed another testing round before I was happy. I did that as well and it looks fine.

>pattern would be wrong. I think this second usage should be changed over >to a separate attribute, e.g. remove_default_doc, which would be a string >list, and if empty, nothing would be removed. local_default_doc would >then revert to it's previous local_urls only function. E.g.:

That about mirrors my thinking as well. I'd like to get Retriever to use a StringList, but it's not as easy as I'd like and I haven't had a chance to do it.

Thanks, -Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:16 1999 Return-Path: <grdetil@scrc.umanitoba.ca> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id LAA23158 for <andrew@contigo.com>; Tue, 26 Jan 1999 11:28:57 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id LAA05135; Tue, 26 Jan 1999 11:28:33 -0800 (PST) From: Gilles Detillieux <grdetil@scrc.umanitoba.ca> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36AE1773.BeroList-2.5.9@sob.htdig.org> Date: Tue, 26 Jan 1999 13:27:52 -0600 (CST) Cc: grdetil@scrc.umanitoba.ca (Gilles Detillieux) In-Reply-To: <36AA8052.BeroList-2.5.5@sob.htdig.org> from "Geoff Hutchison" at Jan 23, 99 08:55:14 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: [htdig3-dev] Re: What's left for 3.1.0

According to Geoff Hutchison: > Ok, now that we're essentially feature-frozen, I'll list some remaining > work before 3.1.0 goes out the door and I'll be taking a break. If anyone > has questions or can think of something I've forgotten, let me know. > > REPORTED SHOWSTOPPERS: > * htdig loops forever when the server sends a message-length different from > what's sent.

Hmmm. That's strange, as htdig doesn't even look at the Content-length header when retrieving from the HTTP server. It just reads until the read() request returns 0 bytes (an EOF). Maybe this particular server, at M.I.T. according to the bugs DB, wasn't closing the socket properly?

> * htdig coredumps when calling strftime (PR#81)

I've been scratching my head over this one for some time. The code looks fine to me. I can't imagine why strftime would segfault. The buffer it's given is fine, as is the pointer to the tm structure. The strace that the user gave suggests the error happened during or just after the munmap call, which oddly has an address that's different than the mmap call - this leads me to think that the memory corruption happened while processing the zoneinfo file, so maybe he has a corrupt /usr/lib/zoneinfo/localtime?

> * htsearch can coredump if a file in template_map doesn't exist

I think there may have been a problem in the past, before the check to make sure the template_map string list had a multiple of 3 entries. If the file name was missing from the last triad, Template::readFile() would have called fopen() with a char * NULL as the file name, which could cause a core dump on some systems. A file that doesn't exist would just cause the template strings not to be set, so they'd remain empty. The Display.cc code seems to handle that situation correctly - expandVariables treats a char * NULL as equivalent to "", and with the new changes to String.cc, a String::get() now returns "" instead of NULL when the String is empty, so there's even less of a chance of a problem here. If the person who reported this problem can be persuaded to test out the current snapshot or CVS tree, great, but otherwise I think this problem is solved already.

> * Add config option "omit_default_doc" to decide whether we strip off > index.html (or local_default_doc) since some servers wreck havoc with this > behavior.

There are a couple problems with the local_default_doc stuff as it stands now. 1) htdig/Retriever.cc treats this attribute as a single string, and it's only used for local_urls there, whereas the recent change to htlib/URL.cc treats local_default_doc as a string list. 2) the Join() function is given a lower-case "l" instead of a vertical bar "|" as the separator, so if local_default_doc ever was used as a string list, the pattern would be wrong. I think this second usage should be changed over to a separate attribute, e.g. remove_default_doc, which would be a string list, and if empty, nothing would be removed. local_default_doc would then revert to it's previous local_urls only function. E.g.:

--- ./htcommon/defaults.cc.defdoc Thu Jan 21 07:41:50 1999 +++ ./htcommon/defaults.cc Tue Jan 26 13:23:31 1999 @@ -239,6 +239,7 @@ {"prefix_match_character", "*"}, {"prev_page_text", "[prev]"}, {"remove_bad_urls", "true"}, + {"remove_default_doc", "index.html"}, {"robotstxt_name", "htdig"}, {"search_algorithm", "exact:1"}, {"search_results_footer", "${common_dir}/footer.html"}, --- ./htlib/URL.cc.defdoc Thu Jan 14 22:37:17 1999 +++ ./htlib/URL.cc Tue Jan 26 13:22:29 1999 @@ -462,7 +462,7 @@ //***************************************************************************** // void URL::removeIndex(String &path) -// Attempt to remove the local_default_doc from the end of a URL path. +// Attempt to remove the remove_default_doc from the end of a URL path. // This needs to be done to normalize the paths and make .../ the // same as .../index.html // @@ -479,11 +479,14 @@ if (! defaultdoc) { - StringList l(config["local_default_doc"], " \t"); + StringList l(config["remove_default_doc"], " \t"); defaultdoc = new StringMatch(); - defaultdoc->Pattern(l.Join('l')); + defaultdoc->IgnoreCase(); + defaultdoc->Pattern(l.Join('|')); + l.Release(); } - if (defaultdoc->FindFirstWord(path.sub(filename)) >= 0) + if (defaultdoc->hasPattern() && + defaultdoc->FindFirstWord(path.sub(filename)) >= 0) path.chop(path.length() - filename); }

I don't know if the IgnoreCase() and l.Release() are needed or not, but I put them in for good measure. Feel free to change this as you see fit.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id MAA25599
	for <andrew@contigo.com>; Tue, 26 Jan 1999 12:17:57 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA05416;
	Tue, 26 Jan 1999 12:17:51 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36AE22F0.BeroList-2.5.9@sob.htdig.org>
Date: Tue, 26 Jan 1999 14:17:02 -0600 (CST)
In-Reply-To: <36AA8052.BeroList-2.5.5@sob.htdig.org> from "Geoff Hutchison" at Jan 23, 99 08:55:14 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: [htdig3-dev] Re: What's left for 3.1.0

According to Geoff Hutchison: > OTHER BUGS: .. > * Double slashes are eliminated even if they're part of a CGI query string.

How's this?

--- ./htlib/URL.cc.slashbug Tue Jan 26 13:22:29 1999 +++ ./htlib/URL.cc Tue Jan 26 14:05:05 1999 @@ -384,7 +384,10 @@ // We will rewrite the path to be the minimal. // int i, limit; - while ((i = _path.indexOf("/../")) >= 0) + int pathend = _path.indexOf('?'); // Don't mess up query strings. + if (pathend < 0) + pathend = _path.length(); + while ((i = _path.indexOf("/../")) >= 0 && i < pathend) { if ((limit = _path.lastIndexOf('/', i - 1)) >= 0) { @@ -397,39 +400,51 @@ { _path = _path.sub(i + 3).get(); } + pathend = _path.indexOf('?'); + if (pathend < 0) + pathend = _path.length(); } // // Also get rid of redundent "/./". This could cause infinite // loops. // - while ((i = _path.indexOf("/./")) >= 0) + while ((i = _path.indexOf("/./")) >= 0 && i < pathend) { String newPath; newPath << _path.sub(0, i).get(); newPath << _path.sub(i + 2).get(); _path = newPath; + pathend = _path.indexOf('?'); + if (pathend < 0) + pathend = _path.length(); } // // Furthermore, get rid of "//". This could also cause loops // - while ((i = _path.indexOf("//")) >= 0) + while ((i = _path.indexOf("//")) >= 0 && i < pathend) { String newPath; newPath << _path.sub(0, i).get(); newPath << _path.sub(i + 1).get(); _path = newPath; + pathend = _path.indexOf('?'); + if (pathend < 0) + pathend = _path.length(); } // Finally change all "%7E" to "~" for sanity - while ((i = _path.indexOf("%7E")) >= 0) + while ((i = _path.indexOf("%7E")) >= 0 && i < pathend) { String newPath; newPath << _path.sub(0, i).get(); newPath << "~"; newPath << _path.sub(i + 3).get(); _path = newPath; + pathend = _path.indexOf('?'); + if (pathend < 0) + pathend = _path.length(); } }

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id MAA25695
	for <andrew@contigo.com>; Tue, 26 Jan 1999 12:20:10 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA05433;
	Tue, 26 Jan 1999 12:20:01 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36AE2372.BeroList-2.5.9@sob.htdig.org>
Date: Tue, 26 Jan 1999 14:19:25 -0600 (CST)
In-Reply-To: <36A92FF5.BeroList-2.5.5@sob.htdig.org> from "Geoff Hutchison" at Jan 22, 99 08:54:10 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: Re: [htdig3-dev] Removing $Log$ from source files

According to Geoff Hutchison: > Hans-Peter and I were discussing various schemes and intrigues for future > ht://Dig development. He suggested removing the CVS messages at the tops of > all the source files. I agreed, though it would obviously require going > through all the files and ripping them out. > > Aside from that drawback, it would probably make the patch a bit bigger > (but the tar file smaller). It doesn't change the source itself much. > > So... Another vote, same deal as last time, either send me a message > personally or to the list. > > Geoff +1 > (Hans-Peter +1)

My vote is probably too late to matter, but I like Andrew's idea of incremental removals.

> BTW, the results of the previous vote were positive. Gilles patch goes in > tonight. Have a great weekend everyone,

Yipeee! :)

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id NAA29069
	for <andrew@contigo.com>; Tue, 26 Jan 1999 13:29:23 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id NAA05634;
	Tue, 26 Jan 1999 13:29:11 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36AE33AA.BeroList-2.5.9@sob.htdig.org>
Date: Tue, 26 Jan 1999 15:28:29 -0600 (CST)
Cc: grdetil@scrc.umanitoba.ca (Gilles Detillieux)
In-Reply-To: <36AE2031.BeroList-2.5.9@sob.htdig.org> from "Geoff Hutchison" at Jan 26, 99 03:03:12 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: Re: [htdig3-dev] Re: What's left for 3.1.0

According to Geoff Hutchison: > >Hmmm. That's strange, as htdig doesn't even look at the Content-length > >header when retrieving from the HTTP server. It just reads until the > >read() request returns 0 bytes (an EOF). Maybe this particular server, > >at M.I.T. according to the bugs DB, wasn't closing the socket properly? > > That's what I was thinking. Since I don't have an actual address, I'm kinda > stuck. I think our behavior should *avoid* the problem mentioned.

OK, how about the patch below?

> >> * htdig coredumps when calling strftime (PR#81) > >which oddly has an address that's different than the mmap call - this leads > >me to think that the memory corruption happened while processing the > >zoneinfo file, so maybe he has a corrupt /usr/lib/zoneinfo/localtime? > > Now that's a good point. I could understand the prior problems when we got > back a NULL and sent it on its way to blow up in our faces. But that's not > happening (and I have a conditional to prevent it).

I was wondering about that conditional. If you declare "struct tm tm;", then isn't &tm guaranteed to be non-NULL? The variable is automatically allocated, so I don't see how its address could be NULL, regardless of what mystrptime puts into the structure. On the other hand, checking tm2, set by gmtime(), would make sense because it's a pointer. Mind you, I don't think gmtime would ever return NULL either.

> >> * htsearch can coredump if a file in template_map doesn't exist > >here. If the person who reported this problem can be persuaded to test > >out the current snapshot or CVS tree, great, but otherwise I think this > >problem is solved already. > > I would tend to agree here. I included the remark simply because I thought > it needed another testing round before I was happy. I did that as well and > it looks fine. > > >pattern would be wrong. I think this second usage should be changed over > >to a separate attribute, e.g. remove_default_doc, which would be a string > >list, and if empty, nothing would be removed. local_default_doc would > >then revert to it's previous local_urls only function. E.g.: > > That about mirrors my thinking as well. I'd like to get Retriever to use a > StringList, but it's not as easy as I'd like and I haven't had a chance to > do it.

Yeah, handling multiple default documents for the local_urls stuff would be a little trickier (though not much), because you'd need to test each file name to see if it exists before going on to the next.

Here's my patch for the Content-Length header. What do you think?

--- ./htdig/Document.h.contlen Thu Dec 3 22:14:50 1998 +++ ./htdig/Document.h Tue Jan 26 14:24:59 1999 @@ -130,6 +130,7 @@ String contentType; String authorization; String referer; + int contentLength; int document_length; time_t modtime; int max_doc_size; --- ./htdig/Document.cc.contlen Mon Jan 18 16:58:35 1999 +++ ./htdig/Document.cc Tue Jan 26 14:47:37 1999 @@ -159,6 +159,7 @@ contents.allocate(max_doc_size + 100); contentType = ""; + contentLength = -1; if (u) { Url(u); @@ -193,6 +194,7 @@ Document::Reset() { contentType = 0; + contentLength = -1; if (url) delete url; url = 0; @@ -515,16 +517,20 @@ contents = 0; char docBuffer[8192]; int bytesRead; + int bytesToGo = contentLength; - while ((bytesRead = c.read(docBuffer, sizeof(docBuffer))) > 0) - { + if (bytesToGo < 0 || bytesToGo > max_doc_size) + bytesToGo = max_doc_size; + while (bytesToGo > 0) + { + int len = bytesToGo<sizeof(docBuffer) ? bytesToGo : sizeof(docBuffer); + bytesRead = c.read(docBuffer, len); + if (bytesRead <= 0) + break; if (debug > 2) cout << "Read " << bytesRead << " from document\n"; - if (contents.length() + bytesRead > max_doc_size) - bytesRead = max_doc_size - contents.length(); contents.append(docBuffer, bytesRead); - if (contents.length() >= max_doc_size) - break; + bytesToGo -= bytesRead; } c.close(); document_length = contents.length(); @@ -597,6 +603,12 @@ strtok(line, " \t"); modtime = getdate(strtok(0, "\n\t")); } + else if (contentLength == -1 + && mystrncasecmp(line, "content-length:", 15) == 0) + { + strtok(line, " \t"); + contentLength = atoi(strtok(0, "\n\t")); + } else if (mystrncasecmp(line, "content-type:", 13) == 0) { strtok(line, " \t"); @@ -676,6 +688,7 @@ } fclose(f); document_length = contents.length(); + contentLength = document_length; if (debug > 2) cout << "Read a total of " << document_length << " bytes\n";

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id NAA29700
	for <andrew@contigo.com>; Tue, 26 Jan 1999 13:39:40 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id NAA05682;
	Tue, 26 Jan 1999 13:39:35 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36AE3619.BeroList-2.5.9@sob.htdig.org>
Date: Tue, 26 Jan 1999 15:38:56 -0600 (CST)
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: [htdig3-dev] pdf_parser support broken in 012499 snapshot

My apologies if this has already been addressed, but in testing the 012499 snapshot of htdig, I found the pdf_parser support broken. Here's my fix:

--- ./htdig/PDF.cc.pdfbug Sun Jan 17 15:12:05 1999 +++ ./htdig/PDF.cc Tue Jan 26 15:34:36 1999 @@ -111,13 +111,13 @@ acroread = "acroread"; // Check for existance of acroread program! (if not, return) - struct stat stat_buf; + //struct stat stat_buf; // Check that it exists, and is a regular file. - if ((stat(acroread, &stat_buf) == -1) || !S_ISREG(stat_buf.st_mode)) - { - printf("PDF::parse: cannot find acroread\n"); - return; - } + //if ((stat(acroread, &stat_buf) == -1) || !S_ISREG(stat_buf.st_mode)) + // { + // printf("PDF::parse: cannot find acroread\n"); + // return; + // } // Write the pdf contents in a temp file to give it to acroread @@ -151,7 +151,12 @@ // acroread << " -toPostScript " << pdfName << " " << tmpdir << " 2>&1"; acroread << " " << pdfName << " " << psName << " 2>&1"; - system(acroread); + if (system(acroread)) + { + printf("PDF::parse: error running pdf_parser on %s\n", url.get()); + unlink(pdfName); + return; + } FILE* psFile = fopen(psName, "r"); if (!psFile) {

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id OAA31004
	for <andrew@contigo.com>; Tue, 26 Jan 1999 14:04:44 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id OAA05783;
	Tue, 26 Jan 1999 14:04:36 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36AE3BF9.BeroList-2.5.9@sob.htdig.org>
Date: Tue, 26 Jan 1999 16:03:54 -0600 (CST)
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: [htdig3-dev] missing patches from 012499 snapshot

With all the patches I just sent, maybe this isn't the best time to resend patches from earlier. However, I noticed that three of my patches from early last week were not included in the 012499 snapshot, so I thought I'd point it out. Here they are:

This one corresponds to the change to expandVariables, to treat braces as equivalent to parentheses, and is for the handling of the HTSEARCH_RESULTS pseudo-variable.

--- ./htsearch/Display.cc.wrapper3 Mon Jan 18 17:01:34 1999 +++ ./htsearch/Display.cc Thu Jan 21 13:17:27 1999 @@ -222,8 +222,10 @@ header = h; p[-1] = '\0'; } - else if (p > h+1 && p[-1] == '(' && p[-2] == '$' && - p[strlen(wrap_sepr)] == ')') + else if (p > h+1 && p[-2] == '$' && + (p[-1] == '(' || p[-1] == '{') && + (p[strlen(wrap_sepr)] == ')' || + p[strlen(wrap_sepr)] == '}')) { footer = p + strlen(wrap_sepr) + 1; header = h;

This is a small addition to the documentation for my earlier wrapper patch.

--- ./htdoc/hts_general.html.wrapper2 Thu Dec 10 21:26:25 1998 +++ ./htdoc/hts_general.html Mon Jan 18 17:11:10 1999 @@ -44,6 +44,13 @@ The default search results footer file </dd> <dt> + COMMON_DIR/wrapper.html + </dt> + <dd> + The default search results wrapper file, that contains the + header and footer together in one file + </dd> + <dt> COMMON_DIR/nomatch.html </dt> <dd> --- ./htdoc/hts_templates.html.wrapper2 Mon Jan 18 17:01:40 1999 +++ ./htdoc/hts_templates.html Mon Jan 18 17:13:17 1999 @@ -33,6 +33,10 @@ search_results_footer</a> </li> <li> + <a href="attrs.html#search_results_wrapper"> + search_results_wrapper</a> + </li> + <li> <a href="attrs.html#nothing_found_file"> nothing_found_file</a> </li>

This one changes the label for the new sort option to something which I think is more grammatically correct.

--- ./htdoc/config.html.sort4 Mon Jan 18 17:01:40 1999 +++ ./htdoc/config.html Mon Jan 18 17:06:19 1999 @@ -222,7 +222,7 @@ &lt;option value=builtin-long&gt;Long &lt;option value=builtin-short&gt;Short &lt;/select&gt; -Sort: &lt;select name=sort&gt; +Sort by: &lt;select name=sort&gt; &lt;option value=score&gt;Score &lt;option value=time&gt;Time &lt;option value=title&gt;Title @@ -283,7 +283,7 @@ &lt;input type=hidden name=exclude value="$(EXCLUDE)"&gt; Match: $(METHOD) Format: $(FORMAT) -Sort: $(SORT) +Sort by: $(SORT) &lt;br&gt; Refine search: &lt;input type="text" size="30" name="words" value="$(WORDS)"&gt; @@ -356,7 +356,7 @@ &lt;input type=hidden name=exclude value="$(EXCLUDE)"&gt; Match: $(METHOD) Format: $(FORMAT) -Sort: $(SORT) +Sort by: $(SORT) &lt;br&gt; Refine search: &lt;input type="text" size="30" name="words" value="$(WORDS)"&gt; @@ -397,7 +397,7 @@ &lt;input type=hidden name=exclude value="$(EXCLUDE)"&gt; Match: $(METHOD) Format: $(FORMAT) -Sort: $(SORT) +Sort by: $(SORT) &lt;br&gt; Refine search: &lt;input type="text" size="30" name="words" value="$(WORDS)"&gt; --- ./installdir/header.html.sort4 Mon Jan 18 17:01:34 1999 +++ ./installdir/header.html Mon Jan 18 17:05:58 1999 @@ -10,7 +10,7 @@ <input type=hidden name=exclude value="$(EXCLUDE)"> Match: $(METHOD) Format: $(FORMAT) -Sort: $(SORT) +Sort by: $(SORT) <br> Refine search: <input type="text" size="30" name="words" value="$(WORDS)"> --- ./installdir/nomatch.html.sort4 Mon Jan 18 17:01:34 1999 +++ ./installdir/nomatch.html Mon Jan 18 17:06:03 1999 @@ -23,7 +23,7 @@ <input type=hidden name=exclude value="$(EXCLUDE)"> Match: $(METHOD) Format: $(FORMAT) -Sort: $(SORT) +Sort by: $(SORT) <br> Refine search: <input type="text" size="30" name="words" value="$(WORDS)"> --- ./installdir/search.html.sort4 Mon Jan 18 17:01:34 1999 +++ ./installdir/search.html Mon Jan 18 17:06:06 1999 @@ -21,7 +21,7 @@ <option value=builtin-long>Long <option value=builtin-short>Short </select> -Sort: <select name=sort> +Sort by: <select name=sort> <option value=score>Score <option value=time>Time <option value=title>Title --- ./installdir/wrapper.html.sort4 Mon Jan 18 17:01:34 1999 +++ ./installdir/wrapper.html Mon Jan 18 17:06:10 1999 @@ -10,7 +10,7 @@ <input type=hidden name=exclude value="$(EXCLUDE)"> Match: $(METHOD) Format: $(FORMAT) -Sort: $(SORT) +Sort by: $(SORT) <br> Refine search: <input type="text" size="30" name="words" value="$(WORDS)">

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id RAA07379
	for <andrew@contigo.com>; Tue, 26 Jan 1999 17:00:09 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id QAA06556;
	Tue, 26 Jan 1999 16:59:55 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36AE651B.BeroList-2.5.9@sob.htdig.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 26 Jan 1999 19:58:06 -0400
Subject: [htdig3-dev] Fast progess

Thanks mostly to Gilles and his bunch of patches this afternoon, the list of issues is much shorter.

As we're going to turn towards documentation now (as if we haven't done some already), I need some... Didier, could you give me a brief writeup of the suspend/resume feature? Leo, could you give me a brief summary of bad_querystr and allow_in_form (and anything else I forgot)?

Here's the current list (I marked two of the issues as I'll fix them): REPORTED SHOWSTOPPERS: (none at the moment)

OTHER BUGS: * Problems with valid_punctuation and excerpt hilighting (i.e. I'll isn't highlighted in excerpts)

ISSUES: * Fix SGMLEntities to use StringMatch * Move DocumentRef compression to DocHead methods (Geoff) * Run db merge code with sort -m for performance boost (Geoff)

DOCUMENTATION: * TODO * RELEASE * THANKS * Various new options -> attrs, cf_byname, cf_byprog

BTW, the excerpt highlighting problem can be solved by taking the *original* user query (with punctuation) and using that for the search in the excerpt. If no one gets to it in a day or two, I'll see what I wrote down about that.

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:16 1999 Return-Path: <grdetil@scrc.umanitoba.ca> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id NAA19488 for <andrew@contigo.com>; Wed, 27 Jan 1999 13:33:54 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id NAA14037; Wed, 27 Jan 1999 13:33:35 -0800 (PST) From: Gilles Detillieux <grdetil@scrc.umanitoba.ca> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36AF8652.BeroList-2.5.9@sob.htdig.org> Date: Wed, 27 Jan 1999 15:32:09 -0600 (CST) In-Reply-To: <36AE651B.BeroList-2.5.9@sob.htdig.org> from "Geoff Hutchison" at Jan 26, 99 07:58:06 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: [htdig3-dev] patches to speed up htdig

Hi again. I went through the code looking for places to take config dictionary lookups out of loops, to speed up htdig. Here's the patch I came up with.

It includes another little change I made to fix what looked like a bug to me. The handling of the keywords_meta_tag_names attributes was not consistent with the other string list handling. As it was, it seems it needed to have the list elements separated by no more than one space, as each space was replaced with a '|'. Maybe someone can correct me if I'm wrong, but I think using the StringList class for this attribute, and allowing any number of spaces or tabs as separators, is a better way of dealing with it. I don't currently allow commas as separators, to be consistent with other string list attributes, but that could easily be added to the list of separators if someone feels it should be.

--- ./htcommon/WordList.cc.speed Wed Jan 20 19:59:12 1999 +++ ./htcommon/WordList.cc Tue Jan 26 16:26:56 1999 @@ -144,11 +144,12 @@ int control = 0; int alpha = 0; static int allow_numbers = config.Boolean("allow_numbers", 0); + static int minimum_word_length = config.Value("minimum_word_length", 3); if (badwords.Exists(word)) return 0; - if (strlen(word) < config.Value("minimum_word_length")) + if (strlen(word) < minimum_word_length) return 0; while (word && *word) @@ -266,6 +267,7 @@ char *word; String new_word; char *valid_punctuation = config["valid_punctuation"]; + int minimum_word_length = config.Value("minimum_word_length", 3); while (fl && fgets(buffer, sizeof(buffer), fl)) { @@ -277,7 +279,7 @@ new_word = word; // We need to clean it up before we add it new_word.lowercase(); // Just in case someone enters an odd one new_word.remove(valid_punctuation); - if (new_word.length() >= config.Value("minimum_word_length", 3)) + if (new_word.length() >= minimum_word_length) badwords.Add(new_word, 0); } } --- ./htcommon/DocumentRef.cc.speed Sat Jan 23 21:13:42 1999 +++ ./htcommon/DocumentRef.cc Tue Jan 26 17:22:46 1999 @@ -311,7 +311,7 @@ addstring(DOC_NOTIFICATION, s, docNotification); addstring(DOC_SUBJECT, s, docSubject); #ifdef HAVE_LIBZ - int cf=config.Value("compression_level",0); + static int cf=config.Value("compression_level",0); if (cf) { // // Now compress s into c_s @@ -363,7 +363,8 @@ char *s; char *end; String c_s; - if (config.Value("compression_level",0)) { + static int cf=config.Value("compression_level",0); + if (cf) { // Decompress stream z_stream d_stream; /* decompression stream */ @@ -594,9 +595,11 @@ words->DocumentID(docID); // Parse words, taking care of valid_punctuation. - char *p = desc; - char *valid_punctuation = config["valid_punctuation"]; - int minimum_word_length = config.Value("minimum_word_length", 3); + char *p = desc; + static char *valid_punctuation = config["valid_punctuation"]; + static int minimum_word_length = config.Value("minimum_word_length", 3); + static double description_factor = config.Double("description_factor"); + static int max_descriptions = config.Value("max_descriptions", 5); // Not restricted to this size, just used as a hint. String word(MAX_WORD_LENGTH); @@ -616,7 +619,7 @@ if (word.length() >= minimum_word_length) // The wordlist takes care of lowercasing; just add it. - words->Word(word, 0, 0, config.Double("description_factor")); + words->Word(word, 0, 0, description_factor); // No need to count in valid_punctuation for the beginning-char. while (*p && !isalnum(*p)) @@ -627,7 +630,7 @@ words->Flush(); // Now are we at the max_description limit? - if (descriptions.Count() >= config.Value("max_descriptions", 5)) + if (descriptions.Count() >= max_descriptions) return; descriptions.Start_Get(); --- ./htdig/HTML.cc.speed Thu Jan 14 22:52:19 1999 +++ ./htdig/HTML.cc Tue Jan 26 17:23:37 1999 @@ -100,6 +100,7 @@ #include <Configuration.h> #include <ctype.h> #include <StringMatch.h> +#include <StringList.h> #include <URL.h> static StringMatch tags; @@ -131,11 +132,15 @@ hrefMatch.IgnoreCase(); hrefMatch.Pattern("href"); - String keywordNames = config["keywords_meta_tag_names"]; - keywordNames.replace(' ', '|'); - keywordNames.remove(",\t\r\n"); + //String keywordNames = config["keywords_meta_tag_names"]; + //keywordNames.replace(' ', '|'); + //keywordNames.remove(",\t\r\n"); + //keywordsMatch.IgnoreCase(); + //keywordsMatch.Pattern(keywordNames); + StringList keywordNames(config["keywords_meta_tag_names"], " \t"); keywordsMatch.IgnoreCase(); - keywordsMatch.Pattern(keywordNames); + keywordsMatch.Pattern(keywordNames.Join('|')); + keywordNames.Release(); word = 0; href = 0; @@ -203,8 +208,8 @@ // Filter out section marked to be ignored for indexing. // This can contain any HTML. // - char *skip_start = config["noindex_start"]; - char *skip_end = config["noindex_end"]; + static char *skip_start = config["noindex_start"]; + static char *skip_end = config["noindex_end"]; if (strncmp((char *)position, skip_start, strlen(skip_start)) == 0) { q = (unsigned char*)strstr((char *)position, skip_end); --- ./htdig/ExternalParser.cc.speed Wed Jan 20 12:08:29 1999 +++ ./htdig/ExternalParser.cc Tue Jan 26 16:47:00 1999 @@ -239,13 +239,15 @@ // (or class). Which should not stop anybody from // finding a better solution. // For now, there is duplicated code. - StringMatch keywordsMatch; - String keywordNames = config["keywords_meta_tag_names"]; - - keywordNames.replace(' ', '|'); - keywordNames.remove(",\t\r\n"); - keywordsMatch.IgnoreCase(); - keywordsMatch.Pattern(keywordNames); + static StringMatch *keywordsMatch = 0; + if (!keywordsMatch) + { + StringList kn(config["keywords_meta_tag_names"], " \t"); + keywordsMatch = new StringMatch(); + keywordsMatch->IgnoreCase(); + keywordsMatch->Pattern(kn.Join('|')); + kn.Release(); + } // <URL:http://www.w3.org/MarkUp/html-spec/html-spec_5.html#SEC5.2.5> // says that the "name" attribute defaults to @@ -280,7 +282,7 @@ // if (*name != '\0' && *content != '\0') { - if (keywordsMatch.CompareWord(name)) + if (keywordsMatch->CompareWord(name)) { char *w = strtok(content, " ,\t\r"); while (w) --- ./htdig/SGMLEntities.cc.speed Tue Jan 19 23:41:20 1999 +++ ./htdig/SGMLEntities.cc Tue Jan 26 17:03:15 1999 @@ -215,6 +215,9 @@ { String entity; unsigned char *orig = entityStart; + static int translate_quot = config.Boolean("translate_quot"); + static int translate_amp = config.Boolean("translate_amp"); + static int translate_lt_gt = config.Boolean("translate_lt_gt"); if (*entityStart == '&') entityStart++; // Don't need the '&' that starts the entity @@ -225,7 +228,7 @@ entity << *entityStart++; } - if ( !config.Boolean("translate_quot") ) + if ( !translate_quot ) { // // Do NOT translate entities for '"' (quote). @@ -238,7 +241,7 @@ } } - if ( !config.Boolean("translate_amp") ) + if ( !translate_amp ) { // // Do NOT translate entities for '&' since they can @@ -252,7 +255,7 @@ } } - if ( !config.Boolean("translate_lt_gt") ) + if ( !translate_lt_gt ) { // // Do NOT translate entities for '<' and '>' since they can

The results of these patches were surprising. I expected a speed-up, but this about halved the user CPU time in htdig! It reduced the total elapsed time by about 25%. The effect on other utilities (htmerge, htfuzzy) was negligible.

This was the 011799 snapshot: htdig: Run complete htdig: 1 server seen: htdig: www.scrc.umanitoba.ca:80 392 documents 49.00user 4.32system 1:08.25elapsed 78%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (20747major+18666minor)pagefaults 0swaps

This was the 012499 snapshot: htdig: Run complete htdig: 1 server seen: htdig: www.scrc.umanitoba.ca:80 392 documents 50.89user 4.37system 1:06.18elapsed 83%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (19989major+18902minor)pagefaults 0swaps

This was the 012499 snapshot, with my speed-up patches: htdig: Run complete htdig: 1 server seen: htdig: www.scrc.umanitoba.ca:80 392 documents 22.25user 4.64system 0:43.28elapsed 62%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (20668major+19009minor)pagefaults 0swaps

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <andrew@contigo.com>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id OAA21767
	for <andrew@contigo.com>; Wed, 27 Jan 1999 14:12:40 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id OAA14159;
	Wed, 27 Jan 1999 14:12:48 -0800 (PST)
From: Andrew Scherpbier <andrew@contigo.com>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36AF8F72.BeroList-2.5.9@sob.htdig.org>
Sender: turtle@contigo.com
Date: Wed, 27 Jan 1999 14:11:41 -0800
Organization: Contigo Software
X-Mailer: Mozilla 4.5 [en] (X11; I; Linux 2.1.131 i686)
X-Accept-Language: en
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Subject: [htdig3-dev] Uhm...  We're in 1999 now...

I just noticed that all the docs have

"ht://Dig © 1995-1998 Andrew Scherpbier"

Any suggestions on what to do about this? It would be ok with me if the copyright stuff was removed from the pages. (I still get lots of email directly because of that! :-))

At Contigo we solved the copyright year stuff by making the last year the current year through SSI.

-- 
Andrew Scherpbier <andrews@contigo.com>
Contigo Software <http://www.contigo.com/>
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id OAA23908
	for <andrew@contigo.com>; Wed, 27 Jan 1999 14:58:04 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id OAA14284;
	Wed, 27 Jan 1999 14:58:29 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36AF9A17.BeroList-2.5.9@sob.htdig.org>
Date: Wed, 27 Jan 1999 16:57:13 -0600 (CST)
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: [htdig3-dev] various htdoc fixes

OK, I didn't do anything about copyright dates, or even last-modified dates in the htdoc files, but here are a number of fixes I've put together.

--- ./htdoc/attrs.html.docfix1 Wed Jan 20 21:28:21 1999 +++ ./htdoc/attrs.html Wed Jan 27 16:33:40 1999 @@ -1004,7 +1004,7 @@ </dt> <dd> <a href="htfuzzy.html">htfuzzy</a> and <a href= - "htsearch.html" target="_top"></a> + "htsearch.html" target="_top">htsearch</a> </dd> <dt> <em>default:</em> @@ -1056,7 +1056,7 @@ </dt> <dd> <a href="htfuzzy.html">htfuzzy</a> and <a href= - "htsearch.html" target="_top"></a> + "htsearch.html" target="_top">htsearch</a> </dd> <dt> <em>default:</em> @@ -1900,7 +1900,7 @@ <em>type:</em> </dt> <dd> - list + string list </dd> <dt> <em>used by:</em> @@ -2859,7 +2859,7 @@ <em>type:</em> </dt> <dd> - string list + quoted string list </dd> <dt> <em>used by:</em> @@ -3252,6 +3252,75 @@ <hr> <dl> <dt> + <strong><a name="no_page_number_text"> + no_page_number_text</a></strong> + </dt> + <dd> + <dl> + <dt> + <em>type:</em> + </dt> + <dd> + quoted string list + </dd> + <dt> + <em>used by:</em> + </dt> + <dd> + <a href="htsearch.html" target="_top">htsearch</a> + </dd> + <dt> + <em>default:</em> + </dt> + <dd> + <em>&lt;empty&gt;</em> + </dd> + <dt> + <em>description:</em> + </dt> + <dd> + The text strings in this list will be used when putting + together the PAGELIST variable, for use in templates or + the <a href="#search_results_footer">search_results_footer</a> + file, when search results fit on more than page. The PAGELIST + is the list of links at the bottom of the search results page. + There should be as many strings in the list as there are + pages allowed by the <a href="#maximum_pages">maximum_pages</a> + attribute. If there are not enough, or the list is empty, + the page numbers alone will be used as the text for the links. + An entry from this list is used for the current page, as the + current page is shown in the page list without a hypertext link, + while entries from the <a href="#page_number_text"> + page_number_text</a> list are used for the links to other pages. + The text strings can contain HTML tags to highlight page numbers + or embed images. The strings need to be quoted if they contain + spaces. + </dd> + <dt> + <em>example:</em> + </dt> + <dd> + <table border="0"> + <tr> + <td valign="top"> + no_page_number_text: + </td> + <td> + &lt;strong&gt;1&lt;/strong&gt; &lt;strong&gt;2&lt;/strong&gt; \<br> + &lt;strong&gt;3&lt;/strong&gt; &lt;strong&gt;4&lt;/strong&gt; \<br> + &lt;strong&gt;5&lt;/strong&gt; &lt;strong&gt;6&lt;/strong&gt; \<br> + &lt;strong&gt;7&lt;/strong&gt; &lt;strong&gt;8&lt;/strong&gt; \<br> + &lt;strong&gt;9&lt;/strong&gt; &lt;strong&gt;10&lt;/strong&gt; + </td> + </tr> + </table> + </dd> + </dl> + </dd> + </dl> + <hr> + <dl> + <dt> <strong><a name="no_prev_page_text"> no_prev_page_text</a></strong> </dt> @@ -3385,6 +3454,75 @@ <hr> <dl> <dt> + <strong><a name="page_number_text"> + page_number_text</a></strong> + </dt> + <dd> + <dl> + <dt> + <em>type:</em> + </dt> + <dd> + quoted string list + </dd> + <dt> + <em>used by:</em> + </dt> + <dd> + <a href="htsearch.html" target="_top">htsearch</a> + </dd> + <dt> + <em>default:</em> + </dt> + <dd> + <em>&lt;empty&gt;</em> + </dd> + <dt> + <em>description:</em> + </dt> + <dd> + The text strings in this list will be used when putting + together the PAGELIST variable, for use in templates or + the <a href="#search_results_footer">search_results_footer</a> + file, when search results fit on more than page. The PAGELIST + is the list of links at the bottom of the search results page. + There should be as many strings in the list as there are + pages allowed by the <a href="#maximum_pages">maximum_pages</a> + attribute. If there are not enough, or the list is empty, + the page numbers alone will be used as the text for the links. + Entries from this list are used for the links to other pages, + while an entry from the <a href="#no_page_number_text"> + no_page_number_text</a> list is used for the current page, as the + current page is shown in the page list without a hypertext link. + The text strings can contain HTML tags to highlight page numbers + or embed images. The strings need to be quoted if they contain + spaces. + </dd> + <dt> + <em>example:</em> + </dt> + <dd> + <table border="0"> + <tr> + <td valign="top"> + page_number_text: + </td> + <td> + &lt;em&gt;1&lt;/em&gt; &lt;em&gt;2&lt;/em&gt; \<br> + &lt;em&gt;3&lt;/em&gt; &lt;em&gt;4&lt;/em&gt; \<br> + &lt;em&gt;5&lt;/em&gt; &lt;em&gt;6&lt;/em&gt; \<br> + &lt;em&gt;7&lt;/em&gt; &lt;em&gt;8&lt;/em&gt; \<br> + &lt;em&gt;9&lt;/em&gt; &lt;em&gt;10&lt;/em&gt; + </td> + </tr> + </table> + </dd> + </dl> + </dd> + </dl> + <hr> + <dl> + <dt> <strong><a name="pdf_parser"> pdf_parser</a></strong> </dt> @@ -3814,7 +3952,7 @@ between. </dd> <dt> - <b>PAGEHEADER</b> + PAGEHEADER </dt> <dd> This expands to either the value of the <a href= @@ -4243,7 +4381,7 @@ <em>type:</em> </dt> <dd> - string list + quoted string list </dd> <dt> <em>used by:</em> --- ./htdoc/hts_templates.html.docfix1 Tue Jan 26 15:45:20 1999 +++ ./htdoc/hts_templates.html Wed Jan 27 15:45:38 1999 @@ -121,7 +121,7 @@ <dd> A list of URL text descriptions for the matched document. The entries in the list are separated by &lt;br&gt;. These are the - text used between <a href="..."></a> tags. + text used between the &lt;a href...&gt; and &lt;/a&gt;tags. </dd> <dt> <b>DOCID</b> --- ./htdoc/cf_byname.html.docfix1 Wed Jan 20 21:28:26 1999 +++ ./htdoc/cf_byname.html Wed Jan 27 16:36:01 1999 @@ -185,6 +185,8 @@ <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#no_page_list_header">no_page_list_header</a><br> <img src="dot.gif" alt="*"> <a target="body" href= + "attrs.html#no_page_number_text">no_page_number_text</a><br> + <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#no_prev_page_text">no_prev_page_text</a><br> <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#nothing_found_file">nothing_found_file</a><br> @@ -192,6 +194,8 @@ <b>P</b> <font face="helvetica,arial" size="2"><br> <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#page_list_header">page_list_header</a><br> + <img src="dot.gif" alt="*"> <a target="body" href= + "attrs.html#page_number_text">page_number_text</a><br> <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#pdf_parser">pdf_parser</a><br> <img src="dot.gif" alt="*"> <a target="body" href= --- ./htdoc/cf_byprog.html.docfix1 Wed Jan 20 21:28:34 1999 +++ ./htdoc/cf_byprog.html Wed Jan 27 16:36:33 1999 @@ -239,11 +239,15 @@ <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#no_page_list_header">no_page_list_header</a><br> <img src="dot.gif" alt="*"> <a target="body" href= + "attrs.html#no_page_number_text">no_page_number_text</a><br> + <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#no_prev_page_text">no_prev_page_text</a><br> <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#nothing_found_file">nothing_found_file</a><br> <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#page_list_header">page_list_header</a><br> + <img src="dot.gif" alt="*"> <a target="body" href= + "attrs.html#page_number_text">page_number_text</a><br> <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#prefix_match_character">prefix_match_character</a><br> <img src="dot.gif" alt="*"> <a target="body" href=

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id PAA24832
	for <andrew@contigo.com>; Wed, 27 Jan 1999 15:12:48 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id PAA14311;
	Wed, 27 Jan 1999 15:13:18 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36AF9D8F.BeroList-2.5.9@sob.htdig.org>
Date: Wed, 27 Jan 1999 17:12:05 -0600 (CST)
In-Reply-To: <36AF9A17.BeroList-2.5.9@sob.htdig.org> from "Gilles Detillieux" at Jan 27, 99 04:57:13 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: [htdig3-dev] minor fix to star_patterns handling

While looking over how various string lists are handled in the code, I stumbled onto what looks like a bug in htsearch/Display.cc, in its handling of the star_patterns attribute. It's not advancing tokens in pairs correctly, so instead of turning a list of "1 2 3 4" into the mappings 1->2, 3->4, it seems instead it would generate 1->2, 2->3, 3->4. It probably never showed up, because the even entries in the list are .gif file names, which would never match URLs that the search found, but there's no point in generating the superfluous mappings. Here's a trivial fix:

--- ./htsearch/Display.cc.starfix Tue Jan 26 15:45:38 1999 +++ ./htsearch/Display.cc Wed Jan 27 17:01:08 1999 @@ -775,6 +775,8 @@ // token = strtok(0, " \t\r\n"); URLimageList.Add(new String(token)); + if (token) + token = strtok(0, " \t\r\n"); } pattern.chop(1); URLimage.Pattern(pattern);

Geoff, once all the recent patches are in the source tree, could you please make another snapshot, so I can do further testing of the latest changes? Thanks.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id QAA03398
	for <andrew@contigo.com>; Wed, 27 Jan 1999 16:47:32 -0800
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id QAA14653;
	Wed, 27 Jan 1999 16:48:01 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36AFB3C4.BeroList-2.5.9@sob.htdig.org>
In-Reply-To: <36AF8F72.BeroList-2.5.9@sob.htdig.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Date: Wed, 27 Jan 1999 19:42:01 -0400
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by sob.htdig.org id QAA14649
Subject: [htdig3-dev] Copyright (was uhm...  We're in 1999 now...)

>I just noticed that all the docs have > >"ht://Dig © 1995-1998 Andrew Scherpbier" > >Any suggestions on what to do about this? >It would be ok with me if the copyright stuff was removed from the pages. (I >still get lots of email directly because of that! :-))

I've updated the README and whatever source files I've touched. If anyone wants to crunch sed on the htdocs folder, I'd appreciate it.

But this brings up a good point... I'm no legal expert, but can we copyright something like: Copyright (c) 1995-1999 The ht://Dig Project

This would be in the same style as Apache. I think we need something like this on all of the source files too, with some mention of GPL and being a part of the ht://Dig source. But I'd also prefer to avoid FSF-style copyright assignments.

Does anyone know if the Apache Group requires copyright assignments? Does anyone know if we can just copyright as I illustrated, or do we need some sort of legal agency?

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:16 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.1a/8.8.8/Debian/GNU) with ESMTP id RAA04813 for <andrew@contigo.com>; Wed, 27 Jan 1999 17:23:37 -0800 Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id RAA14803; Wed, 27 Jan 1999 17:24:09 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36AFBC3C.BeroList-2.5.9@sob.htdig.org> In-Reply-To: <36AF9D8F.BeroList-2.5.9@sob.htdig.org> References: <36AF9A17.BeroList-2.5.9@sob.htdig.org> from "Gilles Detillieux" at Jan 27, 99 04:57:13 pm Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 27 Jan 1999 20:16:46 -0400 Subject: [htdig3-dev] New snapshot

>Geoff, once all the recent patches are in the source tree, could you please >make another snapshot, so I can do further testing of the latest changes? >Thanks.

Yup. It should be on the website and the ftp server within an hour.

I'll note that your static variable patch will probably help larger digs even more than your site. My initial dig this morning took 6 hours, a full 3 times slower than 3.1.0b4. :-( I expect some of that is due to the zlib code. On the plus side, update digs are about 2 times faster. :-)

Let's give this one a good solid thrashing. I'd like to shake out as many bugs as possible.

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:16 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id SAA09896 for <andrew@contigo.com>; Wed, 27 Jan 1999 18:41:48 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id SAA15193; Wed, 27 Jan 1999 18:42:16 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36AFCE91.BeroList-2.5.9@sob.htdig.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Wed, 27 Jan 1999 21:39:47 -0400 Subject: [htdig3-dev] Debugged excerpt/valid_punctuation

OK, I poked through the code and worked out the problem with finding words with punctuation in the excerpt.

Basically, htsearch takes the user input and puts it into $WORDS. It then does some parsing (applying fuzz and checking for boolean syntax) and puts the result in $LOGICAL_WORDS. When it does this, it generates a StringMatch with the parsed $LOGICAL_WORDS in it. This makes sure fuzzy matches are included in the StringMatch, but it's already stripped out valid_punctuation. :-(

So here's my proposed fix. In addition to the logicalWords currently placed in searchWordsPattern in htsearch.cc, we should ALSO add the user's original input. This should include the punctuation and ensure that these words are considered when looking up the excerpt and doing hilighting.

Does this make sense?

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:16 1999 Return-Path: <grdetil@scrc.umanitoba.ca> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id HAA32750 for <andrew@contigo.com>; Thu, 28 Jan 1999 07:31:36 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id HAA18046; Thu, 28 Jan 1999 07:31:13 -0800 (PST) From: Gilles Detillieux <grdetil@scrc.umanitoba.ca> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B082D8.BeroList-2.5.9@sob.htdig.org> Date: Thu, 28 Jan 1999 09:29:35 -0600 (CST) In-Reply-To: <36AFCE91.BeroList-2.5.9@sob.htdig.org> from "Geoff Hutchison" at Jan 27, 99 09:39:47 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Re: [htdig3-dev] Debugged excerpt/valid_punctuation

According to Geoff Hutchison: > OK, I poked through the code and worked out the problem with finding words > with punctuation in the excerpt. > > Basically, htsearch takes the user input and puts it into $WORDS. It then > does some parsing (applying fuzz and checking for boolean syntax) and puts > the result in $LOGICAL_WORDS. When it does this, it generates a StringMatch > with the parsed $LOGICAL_WORDS in it. This makes sure fuzzy matches are > included in the StringMatch, but it's already stripped out > valid_punctuation. :-( > > So here's my proposed fix. In addition to the logicalWords currently placed > in searchWordsPattern in htsearch.cc, we should ALSO add the user's > original input. This should include the punctuation and ensure that these > words are considered when looking up the excerpt and doing hilighting. > > Does this make sense?

I think so. The problem is you'd have to do some reparsing of the original input words before adding them to the StringMatch pattern, i.e. breaking up the string into a string list, stripping out boolean operators if necessary. I've peeked at the parsing code a bit, but I'm afraid I don't understand its workings enough to suggest exactly how the input string would need to be reparsed to do this correctly. If you want to give it a shot, I'd be grateful.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id IAA02867
	for <andrew@contigo.com>; Thu, 28 Jan 1999 08:15:57 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id IAA18302;
	Thu, 28 Jan 1999 08:16:30 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B08D61.BeroList-2.5.9@sob.htdig.org>
Date: Thu, 28 Jan 1999 10:14:57 -0600 (CST)
In-Reply-To: <36AFBC3C.BeroList-2.5.9@sob.htdig.org> from "Geoff Hutchison" at Jan 27, 99 08:16:46 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: [htdig3-dev] another documentation update

Hi again. If my remove_default_doc patch is included, here's the documentation for it, in the patch below.

I've also been thinking that the following two items should be added to the TODO list for future releases: - better internationalization - use mime.types to determine content-type of files from the local filesystem

I've been toying with the idea of tackling the 2nd one myself, but as much as I'd enjoy the challenge, I'm finding it hard to justify the time it would take to do. Plus, I'm now WAY past the feature freeze date. :)

--- ./htdoc/cf_byname.html.docfix2 Wed Jan 27 16:36:01 1999 +++ ./htdoc/cf_byname.html Thu Jan 28 09:50:17 1999 @@ -207,6 +207,8 @@ <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#remove_bad_urls">remove_bad_urls</a><br> <img src="dot.gif" alt="*"> <a target="body" href= + "attrs.html#remove_default_doc">remove_default_doc</a><br> + <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#robotstxt_name">robotstxt_name</a><br> </font> <br> <b>S</b> <font face="helvetica,arial" size="2"><br> --- ./htdoc/cf_byprog.html.docfix2 Wed Jan 27 16:36:33 1999 +++ ./htdoc/cf_byprog.html Thu Jan 28 09:51:07 1999 @@ -108,6 +108,8 @@ <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#pdf_parser">pdf_parser</a><br> <img src="dot.gif" alt="*"> <a target="body" href= + "attrs.html#remove_default_doc">remove_default_doc</a><br> + <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#robotstxt_name">robotstxt_name</a><br> <img src="dot.gif" alt="*"> <a target="body" href= "attrs.html#server_aliases">server_aliases</a><br> --- ./htdoc/attrs.html.docfix2 Wed Jan 27 16:33:40 1999 +++ ./htdoc/attrs.html Thu Jan 28 10:13:20 1999 @@ -3706,6 +3706,57 @@ <hr> <dl> <dt> + <strong><a name="remove_default_doc">remove_default_doc</a></strong> + </dt> + <dd> + <dl> + <dt> + <em>type:</em> + </dt> + <dd> + string list + </dd> + <dt> + <em>used by:</em> + </dt> + <dd> + <a href="htdig.html">htdig</a> + </dd> + <dt> + <em>default:</em> + </dt> + <dd> + index.html + </dd> + <dt> + <em>description:</em> + </dt> + <dd> + Set this to the default documents in a directory used by the + servers you are indexing. These document names will be stripped + off of URLs when they are normalized, if one of these names appears + after the final slash, to translate URLs like + http://foo.com/index.html into http://foo.com/<br> + Note that you can disable stripping of these names during + normalization by setting the list to an empty string. + The list should only contain names that all servers you index + recognize as default documents for directory URLs, as defined + by the DirectoryIndex setting in Apache's srm.conf, for example. + </dd> + <dt> + <em>example:</em> + </dt> + <dd> + remove_default_doc: default.html default.htm index.html index.htm + <br><em>or</em><br> + remove_default_doc: + </dd> + </dl> + </dd> + </dl> + <hr> + <dl> + <dt> <strong><a name="robotstxt_name"> robotstxt_name</a></strong> </dt>

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id IAA04168
	for <andrew@contigo.com>; Thu, 28 Jan 1999 08:30:44 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id IAA18358;
	Thu, 28 Jan 1999 08:31:37 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B090EB.BeroList-2.5.9@sob.htdig.org>
Date: Thu, 28 Jan 1999 11:30:05 -0500 (EST)
In-Reply-To: <36B08D61.BeroList-2.5.9@sob.htdig.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: Re: [htdig3-dev] another documentation update

On Thu, 28 Jan 1999, Gilles Detillieux wrote:

> I've also been thinking that the following two items should be added to the > TODO list for future releases: > - better internationalization > - use mime.types to determine content-type of files from the > local filesystem

TODO basically needs to be rewritten from scratch. Many of the items are finished and we have some new priorities. It looks like a rewrite into Java is on hold for now.

For better internationalization, I'd include a "change to ASCII" subpoint. Maybe I'll submit a patch for my suggestions to TODO.html tonight.

> I've been toying with the idea of tackling the 2nd one myself, but as much > as I'd enjoy the challenge, I'm finding it hard to justify the time it would > take to do. Plus, I'm now WAY past the feature freeze date. :)

It's definitely on the table for 3.2. :-)

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:16 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id KAA10146 for <andrew@contigo.com>; Thu, 28 Jan 1999 10:12:01 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id KAA18696; Thu, 28 Jan 1999 10:12:43 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B0A8A7.BeroList-2.5.9@sob.htdig.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Thu, 28 Jan 1999 13:10:52 -0400 Subject: [htdig3-dev] TODO update

OK, here's an update to the TODO file. I expect others may have additional suggestions. I'll briefly mention things I'd like to see in 3.2. First off, I'd like to see multiple transport protocols and this will most likely require the mime.types stuff Gilles was talking about (and not just for local files!).

I'd also like to see us redesign the database backend. This will completely break compatibility, but we really have to do it sometime. There are several features that I mention in the TODO that people really want that require changes to our database code, including phrase searching, parallel digging, field searches, etc.

I don't want to talk about 3.2 too much since we still have to clean up 3.1.0 and get it out. But my initial focus post-3.1.0 will be in configure and Makefiles and the htlib. So if anyone has suggestions for them, please let me know.

-Geoff

Index: TODO.html =================================================================== RCS file: /opt/htdig/cvs/htdig3/htdoc/TODO.html,v retrieving revision 1.3 diff -c -3 -r1.3 TODO.html *** TODO.html 1998/12/08 02:51:25 1.3 --- TODO.html 1999/01/28 18:06:30 *************** *** 29,87 **** href="mailto:htdig3-bugs@htdig.org">&lt;htdig3-bugs@htdig.org&gt;</a> </p> <ul> - <li type="disk"> - Start htdig with multiple start documents. - </li> - <li type="disk"> - Allow attribute references in the values of other - attributes - </li> - <li type="disk"> - Abstract the database so that other database backends can - be used. Currently only GDBM can be used. - <ul> - <li type="disk"> - Create a database class that uses GDBM - </li> - <li type="disk"> - Add support for Berkeley DB - </li> - <li type="square"> - Add support for Oracle - </li> - <li type="square"> - Add support for SQL - </li> - </ul> - </li> <li type="square"> ! Merge multiple htdig databases together </li> ! <li type="square"> ! Add support for BSDI make program </li> <li type="square"> ! Better examples of configuration stuff ! </li> ! <li type="disk"> ! Complete automatic installation ! </li> ! <li type="square"> ! Rewrite everything in Java ! </li> ! <li type="disk"> ! Add more document parsers ! <ul> ! <li type="disk"> ! External document parser support ! </li> ! <li type="disk"> ! PostScript ! </li> ! <li type="disk"> ! PDF ! </li> ! </ul> </li> <li type="square"> Add support for different transport protocols --- 29,56 ---- href="mailto:htdig3-bugs@htdig.org">&lt;htdig3-bugs@htdig.org&gt;</a> </p> <ul> <li type="square"> ! Redesign the database backend to support additional enhancements: ! <ul> ! <li type="square"> ! Phrase searching ! </li> ! <li type="square"> ! Field-based searching> ! </li> ! <li type="square"> ! "Collections" of multiple databases ! </li> ! <li type="square"> ! Continual indexing ! </li> ! <li type="square"> ! Parallel indexing and searching </li> ! </ul> </li> <li type="square"> ! Add support for BSDI make program </li> <li type="square"> Add support for different transport protocols *************** *** 107,131 **** </ul> </li> <li type="square"> ! Parallel indexing ! </li> ! <li type="disk"> ! Allow for external document parsing programs ! </li> ! <li type="disk"> ! Add logging to htsearch ! </li> ! <li type="disk"> ! Add support for non ASCII characters (translate them) ! </li> ! <li type="circle"> ! Add a web-based (CGI) URL/server registration system </li> <li type="square"> ! Include several examples of result templates ! </li> ! <li type="circle"> ! Binary release </li> <li type="square"> Eliminate or detect duplicate documents --- 76,92 ---- </ul> </li> <li type="square"> ! Better Internationalization ! <ul> ! <li type="square"> ! Support for UTF-8 ! </li> ! <li type="square"> ! Allow character translation (e.g. remove accents) ! </li> </li> <li type="square"> ! Better examples of configuration and result templates </li> <li type="square"> Eliminate or detect duplicate documents *************** *** 171,177 **** &lt;andrew@contigo.com&gt;</a> </address> <!-- hhmts start --> ! Last modified: Mon Nov 23 13:17:41 EST 1998 <!-- hhmts end --> </body> </html> --- 132,138 ---- &lt;andrew@contigo.com&gt;</a> </address> <!-- hhmts start --> ! Last modified: Thu Jan 28 13:04:41 EST 1999 <!-- hhmts end --> </body> </html>

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:16 1999 Return-Path: <grdetil@scrc.umanitoba.ca> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id LAA17783 for <andrew@contigo.com>; Thu, 28 Jan 1999 11:17:49 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id LAA18917; Thu, 28 Jan 1999 11:18:45 -0800 (PST) From: Gilles Detillieux <grdetil@scrc.umanitoba.ca> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B0B816.BeroList-2.5.9@sob.htdig.org> Date: Thu, 28 Jan 1999 13:17:07 -0600 (CST) In-Reply-To: <36B0A8A7.BeroList-2.5.9@sob.htdig.org> from "Geoff Hutchison" at Jan 28, 99 01:10:52 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: [htdig3-dev] new rundig script

Hi Geoff. I've just made some changes to rundig that I thought would be worth including in 3.1.0. The main fix is that there is a portable test to see if the synonyms or word2root DBs need updating. I've also improved the argument handling, and added support for complete rebuilds using -a. The reason for the DBDIR, COMMONDIR & BINDIR shell variables is to make it easier to customize after it's installed.

It might need more testing on other systems, to make sure it is indeed portable, but so far so good here.

--- rundig.geoff Wed Jan 6 21:17:15 1999 +++ rundig Thu Jan 28 12:58:32 1999 @@ -3,13 +3,23 @@ # # rundig # -# $Id: rundig,v 1.5 1999/01/07 03:17:15 ghutchis Exp $ +# $Id: rundig,v 1.6 1999/01/28 12:14:15 ghutchis Exp $ # # This is a sample script to create a search database for ht://Dig. # -if [ "$1" = "-v" ]; then - verbose=-v -fi +DBDIR=@DATABASE_DIR@ +COMMONDIR=@COMMON_DIR@ +BINDIR=@BIN_DIR@ + +stats= opts= alt= +for arg +do + case "$arg" in + -a) alt="$arg" ;; + -s) stats="$arg" ;; + *) opts="$opts $arg" ;; # e.g. -v or -c config + esac +done # # Set the TMPDIR variable if you want htmerge to put files in a location @@ -18,25 +28,36 @@ # on some systems, /tmp is a memory mapped filesystem that takes away # from virtual memory. # -TMPDIR=@DATABASE_DIR@ +TMPDIR=$DBDIR export TMPDIR -@BIN_DIR@/htdig -i $verbose -s -@BIN_DIR@/htmerge $verbose -s -@BIN_DIR@/htnotify $verbose +$BINDIR/htdig -i $opts $stats $alt +$BINDIR/htmerge $opts $stats $alt +case "$alt" in +-a) + ( cd $DBDIR && test -f db.docdb.work && + for f in *.work + do + mv -f $f `basename $f .work` + done ) ;; +esac +$BINDIR/htnotify $opts +$BINDIR/htfuzzy $opts soundex metaphone # # Create the endings and synonym databases if they don't exist -# or if they're older than the files they're generated from! +# or if they're older than the files they're generated from. +# These databases are semi-static, so even if pages change, +# these databases will not need to be rebuilt. # - -# Do they exist? -if [ ! -f @COMMON_DIR@/word2root.db ] +if [ "`ls -t $COMMONDIR/english.0 $COMMONDIR/word2root.db 2>/dev/null`" = \ + "$COMMONDIR/english.0" ] then - @BIN_DIR@/htfuzzy $verbose endings + $BINDIR/htfuzzy $opts endings fi -if [ ! -f @COMMON_DIR@/synonyms.db ] +if [ "`ls -t $COMMONDIR/synonyms $COMMONDIR/synonyms.db 2>/dev/null`" = \ + "$COMMONDIR/synonyms" ] then - @BIN_DIR@/htfuzzy $verbose synonyms + $BINDIR/htfuzzy $opts synonyms fi

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:16 1999
Return-Path: <csf@moscow.com>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id LAA17563
	for <andrew@contigo.com>; Thu, 28 Jan 1999 11:13:11 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id LAA18902;
	Thu, 28 Jan 1999 11:13:57 -0800 (PST)
From: csf@moscow.com (M. Yount)
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B0B6FC.BeroList-2.5.9@sob.htdig.org>
In-Reply-To: Your message of "Wed, 27 Jan 1999 19:42:01 -0400."
             <36AFB3C4.BeroList-2.5.9@sob.htdig.org> 
Date: Thu, 28 Jan 1999 11:18:33 -0800
Sender: csf@moscow.com
Subject: Re: [htdig3-dev] Copyright and Content-Encoding 

Geoff,

The majordomo 2 license is based upon the Apache Group's license, with the copyright...

Copyright (c) 1997, 1998 Jason Tibbitts for The Majordomo Development Group. All rights reserved.

Jason is amiable and would probably be willing to answer your questions if this is the sort of arrangement you, Andrew, & Co. would like. His address (omitted here for UBE reasons) appears at the bottom of

http://www.hpc.uh.edu/majordomo/

On another issue, I'm wondering if you've considered adding hooks for content-encodings in version 4. At first glance, it would be fairly easy to make an ExternalDecoder class in the style of ExternalParser, but IIRC you're planning to rewrite ExternalParser after 3.1 is released.

Thanks,

Michael csf@moscow.com

>> > >I just noticed that all the docs have > > > >"ht://Dig © 1995-1998 Andrew Scherpbier" > > > >Any suggestions on what to do about this? > >It would be ok with me if the copyright stuff was removed from the pages. (I > >still get lots of email directly because of that! :-)) > > I've updated the README and whatever source files I've touched. If anyone > wants to crunch sed on the htdocs folder, I'd appreciate it. > > But this brings up a good point... I'm no legal expert, but can we > copyright something like: > Copyright (c) 1995-1999 The ht://Dig Project > > This would be in the same style as Apache. I think we need something like > this on all of the source files too, with some mention of GPL and being a > part of the ht://Dig source. But I'd also prefer to avoid FSF-style > copyright assignments. > > Does anyone know if the Apache Group requires copyright assignments? Does > anyone know if we can just copyright as I illustrated, or do we need some > sort of legal agency? > > -Geoff >

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:16 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id LAA18768 for <andrew@contigo.com>; Thu, 28 Jan 1999 11:38:20 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id LAA19037; Thu, 28 Jan 1999 11:39:17 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B0BCE6.BeroList-2.5.9@sob.htdig.org> Date: Thu, 28 Jan 1999 14:37:41 -0500 (EST) In-Reply-To: <36B0B6FC.BeroList-2.5.9@sob.htdig.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: Re: [htdig3-dev] Copyright and Content-Encoding

On Thu, 28 Jan 1999, M. Yount wrote:

> Jason is amiable and would probably be willing to answer your questions if > this is the sort of arrangement you, Andrew, & Co. would like. His address > (omitted here for UBE reasons) appears at the bottom of

Wonderful, I'll write him a message now.

> On another issue, I'm wondering if you've considered adding hooks for > content-encodings in version 4. At first glance, it would be fairly easy > to make an ExternalDecoder class in the style of ExternalParser, but > IIRC you're planning to rewrite ExternalParser after 3.1 is released.

I would like to add hooks for Content-encodings in 3.2. I think your idea for an ExternalDecoder class is excellent. One obvious decoder is for gzip, compress, and the like. Since I'm promising myself to have general zlib support, this particular decoder should be pretty easy.

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:16 1999 Return-Path: <grdetil@scrc.umanitoba.ca> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id MAA20948 for <andrew@contigo.com>; Thu, 28 Jan 1999 12:23:53 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA19218; Thu, 28 Jan 1999 12:24:37 -0800 (PST) From: Gilles Detillieux <grdetil@scrc.umanitoba.ca> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B0C795.BeroList-2.5.9@sob.htdig.org> Date: Thu, 28 Jan 1999 14:22:54 -0600 (CST) In-Reply-To: <36B0BCE6.BeroList-2.5.9@sob.htdig.org> from "Geoff Hutchison" at Jan 28, 99 02:37:41 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: [htdig3-dev] Makefile.in patch

Oops!...

--- ./Makefile.in.wrapfix Wed Jan 13 20:30:50 1999 +++ ./Makefile.in Thu Jan 28 14:16:17 1999 @@ -90,7 +90,7 @@ @if [ ! -f $(SEARCH_DIR)/$(SEARCH_FORM) ]; then sed -e s%@IMAGEDIR@%$(IMAGE_URL_PREFIX)% $(top_srcdir)/installdir/search.html >$(SEARCH_DIR)/$(SEARCH_FORM); echo $(SEARCH_DIR)/$(SEARCH_FORM);fi @if [ ! -f $(COMMON_DIR)/footer.html ]; then sed -e s%@IMAGEDIR@%$(IMAGE_URL_PREFIX)% $(top_srcdir)/installdir/footer.html >$(COMMON_DIR)/footer.html; echo $(COMMON_DIR)/footer.html;fi @if [ ! -f $(COMMON_DIR)/header.html ]; then sed -e s%@IMAGEDIR@%$(IMAGE_URL_PREFIX)% $(top_srcdir)/installdir/header.html >$(COMMON_DIR)/header.html; echo $(COMMON_DIR)/header.html;fi - @if [ ! -f $(COMMON_DIR)/wrapper.html ]; then sed -e s%@IMAGEDIR@%$(IMAGE_URL_PREFIX)% $(top_srcdir)/installdir/wrapper.html >$(COMMON_DIR)/header.html; echo $(COMMON_DIR)/wrapper.html;fi + @if [ ! -f $(COMMON_DIR)/wrapper.html ]; then sed -e s%@IMAGEDIR@%$(IMAGE_URL_PREFIX)% $(top_srcdir)/installdir/wrapper.html >$(COMMON_DIR)/wrapper.html; echo $(COMMON_DIR)/wrapper.html;fi @if [ ! -f $(COMMON_DIR)/nomatch.html ]; then sed -e s%@IMAGEDIR@%$(IMAGE_URL_PREFIX)% $(top_srcdir)/installdir/nomatch.html >$(COMMON_DIR)/nomatch.html; echo $(COMMON_DIR)/nomatch.html;fi @if [ ! -f $(COMMON_DIR)/syntax.html ]; then sed -e s%@IMAGEDIR@%$(IMAGE_URL_PREFIX)% $(top_srcdir)/installdir/syntax.html >$(COMMON_DIR)/syntax.html; echo $(COMMON_DIR)/syntax.html;fi @if [ ! -f $(COMMON_DIR)/english.0 ]; then $(INSTALL) $(top_srcdir)/installdir/english.0 $(COMMON_DIR); echo $(COMMON_DIR)/english.0;fi

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:17 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id MAA21242
	for <andrew@contigo.com>; Thu, 28 Jan 1999 12:30:30 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA19253;
	Thu, 28 Jan 1999 12:31:27 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B0C920.BeroList-2.5.9@sob.htdig.org>
Date: Thu, 28 Jan 1999 15:29:50 -0500 (EST)
In-Reply-To: <36B0C795.BeroList-2.5.9@sob.htdig.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: Re: [htdig3-dev] Makefile.in patch

> Oops!...

Indeed! Sorry about that.

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <grdetil@scrc.umanitoba.ca> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id OAA30032 for <andrew@contigo.com>; Thu, 28 Jan 1999 14:48:38 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id OAA19647; Thu, 28 Jan 1999 14:49:25 -0800 (PST) From: Gilles Detillieux <grdetil@scrc.umanitoba.ca> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B0E981.BeroList-2.5.9@sob.htdig.org> Date: Thu, 28 Jan 1999 16:47:41 -0600 (CST) In-Reply-To: <36B0C795.BeroList-2.5.9@sob.htdig.org> from "Gilles Detillieux" at Jan 28, 99 02:22:54 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: [htdig3-dev] another Makefile.in patch

Here's another patch to Makefile.in, so the dictionary files in common/ don't get installed with execute permissions turned on.

--- Makefile.in.makefix2 Thu Jan 28 14:16:17 1999 +++ Makefile.in Thu Jan 28 15:23:41 1999 @@ -86,16 +86,16 @@ @echo "" @echo "Installing default configuration files..." @if [ ! -f $(CONFIG_DIR)/htdig.conf ]; then sed -e s%@DATABASE_DIR@%$(DATABASE_DIR)% -e s%@IMAGEDIR@%$(IMAGE_URL_PREFIX)% $(top_srcdir)/installdir/htdig.conf >$(CONFIG_DIR)/htdig.conf; echo $(CONFIG_DIR)/htdig.conf;fi - @if [ ! -f $(COMMON_DIR)/bad_words ]; then $(INSTALL) $(top_srcdir)/installdir/bad_words $(COMMON_DIR); echo $(COMMON_DIR)/bad_words; fi + @if [ ! -f $(COMMON_DIR)/bad_words ]; then $(INSTALL) -m 0664 $(top_srcdir)/installdir/bad_words $(COMMON_DIR); echo $(COMMON_DIR)/bad_words; fi @if [ ! -f $(SEARCH_DIR)/$(SEARCH_FORM) ]; then sed -e s%@IMAGEDIR@%$(IMAGE_URL_PREFIX)% $(top_srcdir)/installdir/search.html >$(SEARCH_DIR)/$(SEARCH_FORM); echo $(SEARCH_DIR)/$(SEARCH_FORM);fi @if [ ! -f $(COMMON_DIR)/footer.html ]; then sed -e s%@IMAGEDIR@%$(IMAGE_URL_PREFIX)% $(top_srcdir)/installdir/footer.html >$(COMMON_DIR)/footer.html; echo $(COMMON_DIR)/footer.html;fi @if [ ! -f $(COMMON_DIR)/header.html ]; then sed -e s%@IMAGEDIR@%$(IMAGE_URL_PREFIX)% $(top_srcdir)/installdir/header.html >$(COMMON_DIR)/header.html; echo $(COMMON_DIR)/header.html;fi @if [ ! -f $(COMMON_DIR)/wrapper.html ]; then sed -e s%@IMAGEDIR@%$(IMAGE_URL_PREFIX)% $(top_srcdir)/installdir/wrapper.html >$(COMMON_DIR)/wrapper.html; echo $(COMMON_DIR)/wrapper.html;fi @if [ ! -f $(COMMON_DIR)/nomatch.html ]; then sed -e s%@IMAGEDIR@%$(IMAGE_URL_PREFIX)% $(top_srcdir)/installdir/nomatch.html >$(COMMON_DIR)/nomatch.html; echo $(COMMON_DIR)/nomatch.html;fi @if [ ! -f $(COMMON_DIR)/syntax.html ]; then sed -e s%@IMAGEDIR@%$(IMAGE_URL_PREFIX)% $(top_srcdir)/installdir/syntax.html >$(COMMON_DIR)/syntax.html; echo $(COMMON_DIR)/syntax.html;fi - @if [ ! -f $(COMMON_DIR)/english.0 ]; then $(INSTALL) $(top_srcdir)/installdir/english.0 $(COMMON_DIR); echo $(COMMON_DIR)/english.0;fi - @if [ ! -f $(COMMON_DIR)/english.aff ]; then $(INSTALL) $(top_srcdir)/installdir/english.aff $(COMMON_DIR); echo $(COMMON_DIR)/english.aff;fi - @if [ ! -f $(COMMON_DIR)/synonyms ]; then $(INSTALL) $(top_srcdir)/installdir/synonyms $(COMMON_DIR); echo $(COMMON_DIR)/synonyms;fi + @if [ ! -f $(COMMON_DIR)/english.0 ]; then $(INSTALL) -m 0664 $(top_srcdir)/installdir/english.0 $(COMMON_DIR); echo $(COMMON_DIR)/english.0;fi + @if [ ! -f $(COMMON_DIR)/english.aff ]; then $(INSTALL) -m 0664 $(top_srcdir)/installdir/english.aff $(COMMON_DIR); echo $(COMMON_DIR)/english.aff;fi + @if [ ! -f $(COMMON_DIR)/synonyms ]; then $(INSTALL) -m 0664 $(top_srcdir)/installdir/synonyms $(COMMON_DIR); echo $(COMMON_DIR)/synonyms;fi @echo "Installing images..." @for i in $(IMAGES); do \ if [ ! -f $(IMAGE_DIR)/$$i ]; then $(INSTALL) -m 0664 $(top_srcdir)/installdir/$$i $(IMAGE_DIR)/$$i; echo $(IMAGE_DIR)/$$i;fi; \

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:17 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id QAA01129
	for <andrew@contigo.com>; Thu, 28 Jan 1999 16:02:09 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id QAA19837;
	Thu, 28 Jan 1999 16:03:09 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B0FAC0.BeroList-2.5.9@sob.htdig.org>
In-Reply-To: <36B0B6FC.BeroList-2.5.9@sob.htdig.org>
References: Your message of "Wed, 27 Jan 1999 19:42:01 -0400."            
 <36AFB3C4.BeroList-2.5.9@sob.htdig.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Thu, 28 Jan 1999 18:56:38 -0400
Subject: Re: [htdig3-dev] Copyright and Content-Encoding

> Copyright (c) 1997, 1998 Jason Tibbitts for > The Majordomo Development Group. > All rights reserved.

Well I talked to Jason. He said he got all his info from the Apache people, but he didn't think we needed to do anything to do group copyrights. He prefers having people put their names in when they've contributed major code (i.e. a whole file).

So Andrew... Do you want your name *completely* removed?

Beyond that, I guess we choose:

Copyright (c) 1995-1999 The ht://Dig Project [Or whatever we want to call the group]

or

Copyright (c) 1995-1999 [Developers] for The ht://Dig Project

I don't really care, but if Andrew wants his name off the web pages, we should do that before we release (should be a simple job for sed).

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <grdetil@scrc.umanitoba.ca> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id PAA32759 for <andrew@contigo.com>; Thu, 28 Jan 1999 15:43:41 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id PAA19785; Thu, 28 Jan 1999 15:44:13 -0800 (PST) From: Gilles Detillieux <grdetil@scrc.umanitoba.ca> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B0F66B.BeroList-2.5.9@sob.htdig.org> Date: Thu, 28 Jan 1999 17:42:25 -0600 (CST) In-Reply-To: <36B0C795.BeroList-2.5.9@sob.htdig.org> from "Gilles Detillieux" at Jan 28, 99 02:22:54 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: [htdig3-dev] bug in sort code in htsearch

Geoff and I have been discussing a strange bit of behaviour in htsearch, when you run the new version on an older DB. htsearch was dropping results from the search results pages, and when sorting on something other than score, it would sometime die altogether.

The missing results, as far as I can tell, are because the new DBs don't map URLs to lower case, so the new htsearch can't find the DocumentRef for URLs with upper case letters, when searching the old DBs.

The missing DocumentRefs caused problems with title and date sorts, which are addressed by this patch:

--- ./htsearch/Display.cc.sortbug Wed Jan 27 18:49:23 1999 +++ ./htsearch/Display.cc Thu Jan 28 17:40:35 1999 @@ -877,6 +877,7 @@ thisMatch = new ResultMatch(); thisMatch->setURL(url); + thisMatch->setRef(NULL); // // Get the actual document record into the current ResultMatch

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:17 1999
Return-Path: <andrews@contigo.com>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id RAA05362
	for <andrew@contigo.com>; Thu, 28 Jan 1999 17:43:38 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id RAA20269;
	Thu, 28 Jan 1999 17:44:34 -0800 (PST)
From: Andrew Scherpbier <andrews@contigo.com>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B1128C.BeroList-2.5.9@sob.htdig.org>
Sender: turtle@contigo.com
Date: Thu, 28 Jan 1999 17:42:51 -0800
Organization: Contigo Software
X-Mailer: Mozilla 4.5 [en] (X11; I; Linux 2.2.0 i686)
X-Accept-Language: en
MIME-Version: 1.0
References: Your message of "Wed, 27 Jan 1999 19:42:01 -0400."            
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Subject: Re: [htdig3-dev] Copyright and Content-Encoding

Geoff Hutchison wrote: > > > Copyright (c) 1997, 1998 Jason Tibbitts for > > The Majordomo Development Group. > > All rights reserved. > > Well I talked to Jason. He said he got all his info from the Apache people, > but he didn't think we needed to do anything to do group copyrights. He > prefers having people put their names in when they've contributed major > code (i.e. a whole file). > > So Andrew... Do you want your name *completely* removed?

I'd like my name removed from the top of all the pages...

> Beyond that, I guess we choose: > > Copyright (c) 1995-1999 The ht://Dig Project > [Or whatever we want to call the group] > > or > > Copyright (c) 1995-1999 [Developers] for The ht://Dig Project > > I don't really care, but if Andrew wants his name off the web pages, we > should do that before we release (should be a simple job for sed).

I don't care either way.

-- 
Andrew Scherpbier <andrews@contigo.com>
Contigo Software <http://www.contigo.com/>
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:17 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id SAA07056
	for <andrew@contigo.com>; Thu, 28 Jan 1999 18:27:38 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id SAA20558;
	Thu, 28 Jan 1999 18:28:44 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B11CDD.BeroList-2.5.9@sob.htdig.org>
Date: Thu, 28 Jan 1999 21:26:58 -0500 (EST)
In-Reply-To: <36B1128C.BeroList-2.5.9@sob.htdig.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: Re: [htdig3-dev] Copyright

On Thu, 28 Jan 1999, Andrew Scherpbier wrote:

> I'd like my name removed from the top of all the pages...

Ok, I just ran the pages through sed. I'll commit them in a second. Right now they say: ht://Dig Copyright &copy; 1995-1999 The ht://Dig Group

It's pretty easy to change it later if people prefer something else.

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id VAA04736 for <andrew@contigo.com>; Thu, 28 Jan 1999 21:41:12 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id VAA21550; Thu, 28 Jan 1999 21:41:46 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B14A41.BeroList-2.5.9@sob.htdig.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 29 Jan 1999 00:39:41 -0400 Subject: [htdig3-dev] Excerpts and Punctuation

OK, I promised a patch for the valid_punctuation problem. Here's a patch that adds the original user input, with punctuation, to the StringMatch used for excerpts.

However, I just noticed excerpt hilighting seems broken on my system. So I can't test it out. :-( I did put in debugging output, so I know it's setting the StringMatch correctly.

If someone could test this, I'd appreciate it. If someone can figure out why my excerpt hilighting isn't working, I'd be very, very happy.

-Geoff

Index: htsearch.cc =================================================================== RCS file: /opt/htdig/cvs/htdig3/htsearch/htsearch.cc,v retrieving revision 1.22 diff -c -3 -r1.22 htsearch.cc *** htsearch.cc 1999/01/21 13:41:24 1.22 --- htsearch.cc 1999/01/29 05:30:40 *************** *** 106,113 ****

ResultList *htsearch(char *, List &, Parser *);

! void setupWords(char *, List &, int, Parser *); ! void createLogicalWords(List &, String &, StringMatch &); void reportError(char *); void convertToBoolean(List &words); void doFuzzy(WeightWord *, List &, List &); --- 35,42 ----

ResultList *htsearch(char *, List &, Parser *);

! void setupWords(char *, List &, int, Parser *, String &); ! void createLogicalWords(List &, String &, String &); void reportError(char *); void convertToBoolean(List &words); void doFuzzy(WeightWord *, List &, List &); *************** *** 133,138 **** --- 62,69 ---- StringMatch limit_to; StringMatch exclude_these; String logicalWords; + String origPattern; + String logicalPattern; StringMatch searchWordsPattern; StringList requiredWords; int i; *************** *** 266,280 **** originalWords.chop(" \t\r\n"); setupWords(originalWords, searchWords, strcmp(config["match_method"], "boolean") == 0, ! parser);

// // Convert the list of WeightWord objects to a pattern string // that we can compile. // ! createLogicalWords(searchWords, logicalWords, searchWordsPattern);

// // If required keywords were given in the search form, we will // modify the current searchWords list to include the required // words. --- 197,220 ---- originalWords.chop(" \t\r\n"); setupWords(originalWords, searchWords, strcmp(config["match_method"], "boolean") == 0, ! parser, origPattern);

// // Convert the list of WeightWord objects to a pattern string // that we can compile. // ! createLogicalWords(searchWords, logicalWords, logicalPattern);

+ // + // Assemble the full pattern for excerpt matching and highlighting // + origPattern += logicalPattern; + searchWordsPattern.Pattern(origPattern); + searchWordsPattern.IgnoreCase(); + if (debug) + cout << "Excerpt pattern: " << origPattern << "\n"; + + // // If required keywords were given in the search form, we will // modify the current searchWords list to include the required // words. *************** *** 336,342 ****

//***************************************************************************** void ! createLogicalWords(List &searchWords, String &logicalWords, StringMatch &wm) { String pattern; int i; --- 276,282 ----

//***************************************************************************** void ! createLogicalWords(List &searchWords, String &logicalWords, String &wm) { String pattern; int i; *************** *** 368,375 **** pattern << ww->word; } } ! wm.IgnoreCase(); ! wm.Pattern(pattern);

if (debug) { --- 308,314 ---- pattern << ww->word; } } ! wm = pattern;

if (debug) { *************** *** 395,404 ****

//***************************************************************************** // void setupWords(char *allWords, List &searchWords, ! // int boolean, Parser *parser) // void ! setupWords(char *allWords, List &searchWords, int boolean, Parser *parser) { List tempWords; int i; --- 334,344 ----

//***************************************************************************** // void setupWords(char *allWords, List &searchWords, ! // int boolean, Parser *parser, String &originalPattern) // void ! setupWords(char *allWords, List &searchWords, int boolean, Parser *parser, ! String &originalPattern) { List tempWords; int i; *************** *** 456,463 **** word << (char) t; t = *pos++; } ! word.remove(valid_punctuation); ! pos--; if (boolean && mystrcasecmp(word.get(), "and") == 0) { tempWords.Add(new WeightWord("&", -1.0)); --- 396,402 ---- word << (char) t; t = *pos++; } ! if (boolean && mystrcasecmp(word.get(), "and") == 0) { tempWords.Add(new WeightWord("&", -1.0)); *************** *** 472,477 **** --- 411,419 ---- } else { + // Add word to excerpt matching list + originalPattern << word << "|"; + word.remove(valid_punctuation); WeightWord *ww = new WeightWord(word, 1.0); if (!badWords.IsValid(word) || word.length() < minimum_word_length) *************** *** 484,489 **** --- 426,432 ---- tempWords.Add(ww); } } + pos--; break; } }

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id VAA04743 for <andrew@contigo.com>; Thu, 28 Jan 1999 21:41:25 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id VAA21556; Thu, 28 Jan 1999 21:42:09 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B14A41.BeroList-2.5.9@sob.htdig.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 29 Jan 1999 00:39:57 -0400 Subject: [htdig3-dev] Odd comment...

OK, I was going through htmerge/docs.cc:

./htmerge/docs.cc: // moet eigenlijk wat tussen, maar heb ik niet gedaan...

Anyone care to share? Dutch? German? It's been a long night.

Goodnight, -Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <rd@ndt.net> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id WAA05516 for <andrew@contigo.com>; Thu, 28 Jan 1999 22:13:36 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id WAA21635; Thu, 28 Jan 1999 22:14:47 -0800 (PST) From: Rolf Diederichs <rd@ndt.net> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B151D9.BeroList-2.5.9@sob.htdig.org> by donar.teuto.de with SMTP; 29 Jan 1999 06:12:50 -0000 X-Sender: rd@pop3.teuto.net X-Mailer: QUALCOMM Windows Eudora Pro Version 4.1 Date: Fri, 29 Jan 1999 07:13:08 +0100 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: [htdig3-dev] WWW Library catalog with htdig

We use HTDIG for remote pages of a WWW Library. HTDIG indexed about 3000 remote pages. A full-text search is often to broad and not well structured. With remote pages we are also not able to control the Meta information.

We want to build categories of these pages, perhaps we'll use a SQL database. However it is a disadvantage that we cannot use the search and the SQL catalog together. Is there any way to create a category search with htdig?

Thanks in advance.

Rolf Diederichs

---------------------------------------------------------------------------- NDT.net The e-Journal of Nondestructive Testing & Ultrasonics Plus NDT online Exhibition * NDTnet - http://www.ndt.net * ----------------------------------------------------------------------------- NDT Internet Publishing Tel: +49(0)5221-769314 Rolf Diederichs FAX: +49(0)5221-769731 Tacheniusweg 8 Email: rd@ndt.net D-32052 Herford

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <rd@ndt.net> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id WAA05526 for <andrew@contigo.com>; Thu, 28 Jan 1999 22:13:47 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id WAA21644; Thu, 28 Jan 1999 22:14:59 -0800 (PST) From: Rolf Diederichs <rd@ndt.net> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B151E5.BeroList-2.5.9@sob.htdig.org> by donar.teuto.de with SMTP; 29 Jan 1999 06:13:07 -0000 X-Sender: rd@pop3.teuto.net X-Mailer: QUALCOMM Windows Eudora Pro Version 4.1 Date: Fri, 29 Jan 1999 07:13:25 +0100 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: [htdig3-dev] How to get a list of URLs together with the Title

We use HTDIG for remote pages of a WWW Library. HTDIG indexed about 3000 remote pages.

We want to build categories of these pages, perhaps we'll use a SQL database. As the first step we need all URLs together with each Title? How is that possible without visiting all pages manually? Any htdig Log file that can be used or generated?

Also it would be excellent if we could get the first sentences of each page.

Actually the htdig search result (long-format) could be useful for export, unfortunately it is necessary to enter a search term. Is there any query to allow a list of all pages?

Thanks in advance.

Rolf Diederichs

---------------------------------------------------------------------------- NDT.net The e-Journal of Nondestructive Testing & Ultrasonics Plus NDT online Exhibition * NDTnet - http://www.ndt.net * ----------------------------------------------------------------------------- NDT Internet Publishing Tel: +49(0)5221-769314 Rolf Diederichs FAX: +49(0)5221-769731 Tacheniusweg 8 Email: rd@ndt.net D-32052 Herford

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <andrew@contigo.com> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id WAA05929 for <andrew@contigo.com>; Thu, 28 Jan 1999 22:29:25 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id WAA21740; Thu, 28 Jan 1999 22:30:34 -0800 (PST) From: Andrew Scherpbier <andrew@contigo.com> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B1558C.BeroList-2.5.9@sob.htdig.org> Sender: turtle@contigo.com Date: Thu, 28 Jan 1999 22:28:17 -0800 Organization: Contigo Software X-Mailer: Mozilla 4.5 [en] (X11; I; Linux 2.2.0-pre6 i686) X-Accept-Language: en MIME-Version: 1.0 References: <36B14A41.BeroList-2.5.9@sob.htdig.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Subject: Re: [htdig3-dev] Odd comment...

Geoff Hutchison wrote: > > OK, I was going through htmerge/docs.cc: > > ./htmerge/docs.cc: // moet eigenlijk wat tussen, maar heb ik niet > gedaan... > > Anyone care to share? Dutch? German? > It's been a long night.

:-) That's Dutch. The translation is:

"needs something in between, but I didn't do that..."

I know for sure I didn't write that!!!

-- 
Andrew Scherpbier <andrews@contigo.com>
Contigo Software <http://www.contigo.com/>
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:17 1999
Return-Path: <webmaster@javawoman.com>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id AAA09444
	for <andrew@contigo.com>; Fri, 29 Jan 1999 00:44:32 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id AAA22657;
	Fri, 29 Jan 1999 00:45:42 -0800 (PST)
From: Marjolein Katsma <webmaster@javawoman.com>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B1753B.BeroList-2.5.9@sob.htdig.org>
X-Sender: javawoma@pop.javawoman.com
X-Mailer: QUALCOMM Windows Eudora Pro Version 4.1 
Date: Fri, 29 Jan 1999 09:43:12 +0100
In-Reply-To: <36B14A41.BeroList-2.5.9@sob.htdig.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Subject: Re: [htdig3-dev] Odd comment...

Geoff,

At 00:39 1999-01-29 -0400, you wrote: > >OK, I was going through htmerge/docs.cc: > >./htmerge/docs.cc: // moet eigenlijk wat tussen, maar heb ik niet >gedaan...

Nothing odd about it - just plain Dutch ;-)

It translates roughly to: " something should be inserted here but I didn't do that" (no reason given but maybe the context can give you a clue)

> >Anyone care to share? Dutch? German? >It's been a long night. > >Goodnight, >-Geoff > > >------------------------------------ >To unsubscribe from the htdig3-dev mailing list, send a message to >htdig3-dev@htdig.org containing the single word "unsubscribe" in >the SUBJECT of the message. >

Cheers,

Marjolein Katsma webmaster@javawoman.com Java Woman - http://javawoman.com/ ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <tlm@po-net.prato.it> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id DAA14433 for <andrew@contigo.com>; Fri, 29 Jan 1999 03:43:38 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id DAA23194; Fri, 29 Jan 1999 03:44:50 -0800 (PST) From: "U.O. Telematica Municipale - Comune di Prato" <tlm@po-net.prato.it> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B19F3A.BeroList-2.5.9@sob.htdig.org> X-Sender: c.giorge@mbox.comune.prato.it X-Mailer: Windows Eudora Pro Version 3.0.1 (32) [I] Date: Fri, 29 Jan 1999 12:44:49 +0100 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: [htdig3-dev] Parsing Ms Word

Hi people !!! I tried to use the external parse htparsedoc from the contrib dir: I compiled the catdoc.c and all went OK. But when I try to run htdig, a core dumps. Is there another external parser available for MS Word documents? If not, can you tell me how to configure it?

This is what I've done with my htdig configuration.

I added this line to htdig.conf:

external_parsers: application/msword /usr1/htdig/bin/htparsedoc

When htdig founds a document with that MIME type, it launches htparsedoc. But at the end of the indexing process I found a core in the directory bin.

Ah, I run htdig on a Linux slakware 2.0.35 (Pentium Celeron 266 Mhx 64MB Ram).

Thanks a lot Ciao Gabriele

----------------------------------------------------------

U.O. Rete Civica - Comune di Prato Via Ricasoli, 4 - 59100 Prato PO Italia Tel. +39 0574616342 Fax +39 0574616003

http://www.comune.prato.it E-Mail: tlm@mbox.comune.prato.it

---------------------------------------------------------- ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <tlm@po-net.prato.it> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id DAA14793 for <andrew@contigo.com>; Fri, 29 Jan 1999 03:56:44 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id DAA23236; Fri, 29 Jan 1999 03:57:59 -0800 (PST) From: "U.O. Telematica Municipale - Comune di Prato" <tlm@po-net.prato.it> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B1A24C.BeroList-2.5.9@sob.htdig.org> X-Sender: c.giorge@mbox.comune.prato.it X-Mailer: Windows Eudora Pro Version 3.0.1 (32) [I] Date: Fri, 29 Jan 1999 12:58:05 +0100 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Subject: [htdig3-dev] Updating only a part of the database

Hi folks!!!

My htdig index all the servers in a database: I run the indexing script via cron the night, with the options -i -a for the htdig. So it wipes the database and create a whole new database with the new links and words.

But every morning, a service, located at a precise URL with pattern /tlm/concorsi/ , is updated. And so I have 2 possibilities:

- Reindex the whole database - Index only the URLs containing the pattern /tlm/concorsi/

Well, I think the 1st chance, could not be very bad, because it takes 15 minutes to do the work. But I think it isn't the most elegant. I tried the second, but I'm not sure about its real goodness. I ran htdig with the option -a (to create .work files) and set the start URL at the home of the service and the limits_urls_to directive to /tlm/concorsi/.

It indexes the right documents, but then it keeps in the database the old files too. Is there a way to erase from the db all the documents with pattern specified in the limits_urls_to or similar, by making possibile the real updating?

I think it could be very useful.

Thanks and Ciao Gabriele

----------------------------------------------------------

U.O. Rete Civica - Comune di Prato Via Ricasoli, 4 - 59100 Prato PO Italia Tel. +39 0574616342 Fax +39 0574616003

http://www.comune.prato.it E-Mail: tlm@mbox.comune.prato.it

---------------------------------------------------------- ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <MSQL_User@st.hhs.nl> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id EAA15449 for <andrew@contigo.com>; Fri, 29 Jan 1999 04:24:41 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id EAA23311; Fri, 29 Jan 1999 04:26:00 -0800 (PST) From: "J. op den Brouw" <MSQL_User@st.hhs.nl> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B1A8DA.BeroList-2.5.9@sob.htdig.org> with SMTP (XT-PP) with ESMTP; Fri, 29 Jan 1999 13:16:58 +0100 Date: Fri, 29 Jan 1999 13:24:11 +0100 X-Mailer: Mozilla 4.03 [en] (Win95; I) MIME-Version: 1.0 References: <36B19F3A.BeroList-2.5.9@sob.htdig.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Subject: Re: [htdig3-dev] Parsing Ms Word

First of all, take the latest version of catdoc. Something like 0.90 or so.

Second there is another script around. see: http://www.st.hhs.nl/htdig/parse_word_doc.pl

Third, there is mswordview, which translates Word 97 files into HTML, but I don't know if someone uses that option

Fourth, catdoc sometimes fails dramaticly when a non-Word file end with .doc and gets parsed by catdoc. It crashed htdig at my place...

U.O. Telematica Municipale - Comune di Prato wrote: > > Hi people !!! I tried to use the external parse htparsedoc from the contrib > dir: I compiled the catdoc.c and all went OK. But when I try to run htdig, > a core dumps. Is there another external parser available for MS Word > documents? If not, can you tell me how to configure it? > > This is what I've done with my htdig configuration. > > I added this line to htdig.conf: > > external_parsers: application/msword /usr1/htdig/bin/htparsedoc > > When htdig founds a document with that MIME type, it launches htparsedoc. > But at the end of the indexing process I found a core in the directory bin. > > Ah, I run htdig on a Linux slakware 2.0.35 (Pentium Celeron 266 Mhx 64MB Ram). > > Thanks a lot > Ciao > Gabriele > > ---------------------------------------------------------- > > U.O. Rete Civica - Comune di Prato > Via Ricasoli, 4 - 59100 Prato PO Italia > Tel. +39 0574616342 Fax +39 0574616003 > > http://www.comune.prato.it > E-Mail: tlm@mbox.comune.prato.it > > ---------------------------------------------------------- > ------------------------------------ > To unsubscribe from the htdig3-dev mailing list, send a message to > htdig3-dev@htdig.org containing the single word "unsubscribe" in > the SUBJECT of the message. ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id FAA17971 for <andrew@contigo.com>; Fri, 29 Jan 1999 05:56:43 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id FAA23672; Fri, 29 Jan 1999 05:58:05 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B1BE6E.BeroList-2.5.9@sob.htdig.org> In-Reply-To: <36B1753B.BeroList-2.5.9@sob.htdig.org> References: <36B14A41.BeroList-2.5.9@sob.htdig.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 29 Jan 1999 08:54:47 -0400 Subject: Re: [htdig3-dev] Odd comment...

>Nothing odd about it - just plain Dutch ;-) > >It translates roughly to: " something should be inserted here but I didn't >do that" (no reason given but maybe the context can give you a clue)

OK, I feel ashamed. I *thought* I had picked up some Dutch from last summer.

The context didn't tell me much. Now I'll need to figure out what "something" was. :-)

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id GAA18732 for <andrew@contigo.com>; Fri, 29 Jan 1999 06:25:44 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id GAA23820; Fri, 29 Jan 1999 06:27:00 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B1C535.BeroList-2.5.9@sob.htdig.org> In-Reply-To: <36B1BA59.BeroList-2.5.9@sob.htdig.org> (Netscape Messaging Server 3.5) with ESMTP id 118 for <htdig3-dev@htdig.org>; Fri, 29 Jan 1999 14: 40:45 +0100 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 29 Jan 1999 09:18:46 -0400 Subject: Re: [htdig3-dev] Buildroot + solaris 2.6 patch

>Here's a patch against htdig-3.1.0b4. It adds a the possibility to fake >the install into another directory. >Nice for building rpm packages and the likes.

I'll take a look at this.

>Furthermore a patch to the configure.in for solaris 2.6 (i386) where >GETPEERNAME_LENGTH_T is of type int. >Don't know if it's the right way how I solved it. (I practically don't >know anything about autoconf).

If you grab the latest CVS snapshot or the CVS tree, you can see that I've fixed this using a pretty clean autoconf construct. Basically I check to see what type we can use as that parameter and set GETPEERNAME_LENGTH_T accordingly.

I haven't tried development versions of egcs for stability reasons. I have enough with hunting down the bugs in ht://Dig :-). I think someone else is looking at that.

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <klaren@telin.nl> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id FAA17496 for <andrew@contigo.com>; Fri, 29 Jan 1999 05:39:22 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id FAA23585; Fri, 29 Jan 1999 05:40:37 -0800 (PST) From: Ric Klaren <klaren@telin.nl> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B1BA59.BeroList-2.5.9@sob.htdig.org> (Netscape Messaging Server 3.5) with ESMTP id 118 for <htdig3-dev@htdig.org>; Fri, 29 Jan 1999 14:40:45 +0100 Sender: "Ric Klaren" <klaren@telin.nl> Date: Fri, 29 Jan 1999 14:41:31 +0000 Organization: Telematica Instituut X-Mailer: Mozilla 4.06 [en] (X11; I; SunOS 5.6 i86pc) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------------B611F6224B3D8A09E6BC59D5" Subject: [htdig3-dev] Buildroot + solaris 2.6 patch

This is a multi-part message in MIME format. --------------B611F6224B3D8A09E6BC59D5 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit

Hi,

Here's a patch against htdig-3.1.0b4. It adds a the possibility to fake the install into another directory. Nice for building rpm packages and the likes.

Furthermore a patch to the configure.in for solaris 2.6 (i386) where GETPEERNAME_LENGTH_T is of type int. Don't know if it's the right way how I solved it. (I practically don't know anything about autoconf).

I'm working on some more patches, but I first want to verify them before sending them in... Is 3.1.0b4 known to work, when compiled with one of the latest egcs releases (egcs-2.92.29)? (I know got rid of most core dumps but htig still refuses to make a good db)

3.1.0b1 and 2 worked without any problems on this configuration.

Regards,

Ric --------------B611F6224B3D8A09E6BC59D5 Content-Type: text/plain; charset=us-ascii; name="htdig.patch" Content-Disposition: inline; filename="htdig.patch" Content-Transfer-Encoding: 7bit

--- htdig-3.1.0b4/htdig/Makefile.in.rkorig Thu Jan 28 16:14:22 1999 +++ htdig-3.1.0b4/htdig/Makefile.in Thu Jan 28 16:14:48 1999 @@ -13,7 +13,7 @@ $(CXX) -o $(TARGET) $(LDFLAGS) $(OBJS) $(LIBS) install: $(TARGET) - $(INSTALL) $(TARGET) $(BIN_DIR) + $(INSTALL) $(TARGET) $(INSTALL_ROOT)$(BIN_DIR) clean: rm -f $(TARGET) $(OBJS) *~ *.bak *% a.out *.orig core --- htdig-3.1.0b4/htfuzzy/Makefile.in.rkorig Thu Jan 28 16:14:22 1999 +++ htdig-3.1.0b4/htfuzzy/Makefile.in Thu Jan 28 16:15:14 1999 @@ -22,7 +22,7 @@ $(RANLIB) $(LIBTARGET) install: $(TARGET) - $(INSTALL) $(TARGET) $(BIN_DIR) + $(INSTALL) $(TARGET) $(INSTALL_ROOT)$(BIN_DIR) clean: rm -f $(TARGET) $(LIBTARGET) $(OBJS) *~ *.bak *% a.out *.orig core --- htdig-3.1.0b4/htmerge/Makefile.in.rkorig Thu Jan 28 16:14:22 1999 +++ htdig-3.1.0b4/htmerge/Makefile.in Thu Jan 28 16:15:53 1999 @@ -11,7 +11,7 @@ $(CXX) -o $(TARGET) $(LDFLAGS) $(OBJS) $(LIBS) install: $(TARGET) - $(INSTALL) $(TARGET) $(BIN_DIR) + $(INSTALL) $(TARGET) $(INSTALL_ROOT)$(BIN_DIR) clean: rm -f $(OBJS) $(TARGET) *~ *% *.bak core a.out *.orig --- htdig-3.1.0b4/htnotify/Makefile.in.rkorig Thu Jan 28 16:14:22 1999 +++ htdig-3.1.0b4/htnotify/Makefile.in Thu Jan 28 16:17:29 1999 @@ -12,7 +12,7 @@ $(CXX) -o $(TARGET) $(LDFLAGS) $(OBJS) $(LIBS) install: $(TARGET) - $(INSTALL) $(TARGET) $(BIN_DIR) + $(INSTALL) $(TARGET) $(INSTALL_ROOT)$(BIN_DIR) clean: rm -f $(TARGET) $(OBJS) *~ *.bak *% a.out *.orig core --- htdig-3.1.0b4/htsearch/Makefile.in.rkorig Thu Jan 28 16:14:22 1999 +++ htdig-3.1.0b4/htsearch/Makefile.in Thu Jan 28 16:17:09 1999 @@ -14,7 +14,7 @@ $(CXX) -o $(TARGET) $(LDFLAGS) $(OBJS) $(FOBJS) $(LIBS) install: all - $(INSTALL) $(TARGET) $(CGIBIN_DIR)/$(TARGET) + $(INSTALL) $(TARGET) $(INSTALL_ROOT)$(CGIBIN_DIR)/$(TARGET) clean: rm -f $(OBJS) $(TARGET) *~ *.bak *% core *.orig a.out --- htdig-3.1.0b4/configure.in.rkorig Thu Jan 28 14:30:09 1999 +++ htdig-3.1.0b4/configure.in Thu Jan 28 14:39:25 1999 @@ -113,6 +113,19 @@ [AC_MSG_RESULT(no);AC_DEFINE(GETPEERNAME_LENGTH_T, unsigned int)]) AC_LANG_C +AC_LANG_CPLUSPLUS +AC_MSG_CHECKING(whether the third argument of getpeername can be a unsigned int?) +AC_TRY_COMPILE([#include <sys/types.h> +#include <sys/socket.h>], +[ int socket; + struct sockaddr server; + unsigned int length; + getpeername(socket, &server, &length);], + [AC_MSG_RESULT(yes);AC_DEFINE(GETPEERNAME_LENGTH_T, unsigned int)], + [AC_MSG_RESULT(no);AC_DEFINE(GETPEERNAME_LENGTH_T, )]) +AC_LANG_C + + AC_PATH_PROG(TSORT, tsort) if test -z "$TSORT"; then AC_MSG_ERROR([GNU rx configuration needs tsort in path!]) --- htdig-3.1.0b4/Makefile.in.rkorig Thu Jan 28 16:14:21 1999 +++ htdig-3.1.0b4/Makefile.in Thu Jan 28 16:54:03 1999 @@ -92,10 +92,10 @@ @if [ ! -f $(COMMON_DIR)/synonyms ]; then $(INSTALL) installdir/synonyms $(INSTALL_ROOT)$(COMMON_DIR); echo $(COMMON_DIR)/synonyms;fi @echo "Installing images..." @for i in $(IMAGES); do \ - if [ ! -f $(IMAGE_DIR)/$$i ]; then $(INSTALL) -m 0664 installdir/$$i $(INSTALL_ROOT)$(IMAGE_DIR)/$$i; echo $(IMAGE_DIR)/$$i;fi; \ + if [ ! -f $(INSTALL_ROOT)$(IMAGE_DIR)/$$i ]; then $(INSTALL) -m 0664 installdir/$$i $(INSTALL_ROOT)$(IMAGE_DIR)/$$i; echo $(IMAGE_DIR)/$$i;fi; \ done && test -z "$$fail" @echo "Creating rundig script..." - @if [ ! -f $(BIN_DIR)/rundig ]; then \ + @if [ ! -f $(INSTALL_ROOT)$(BIN_DIR)/rundig ]; then \ sed -e s%@BIN_DIR@%$(BIN_DIR)% -e s%@COMMON_DIR@%$(COMMON_DIR)% -e s%@DATABASE_DIR@%$(DATABASE_DIR)% installdir/rundig >$(INSTALL_ROOT)$(BIN_DIR)/rundig; \ chmod 755 $(INSTALL_ROOT)$(BIN_DIR)/rundig; \ fi

--------------B611F6224B3D8A09E6BC59D5--

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <s.budd@ic.ac.uk> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id HAA22310 for <andrew@contigo.com>; Fri, 29 Jan 1999 07:54:32 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id HAA24135; Fri, 29 Jan 1999 07:55:04 -0800 (PST) From: "Budd, S." <s.budd@ic.ac.uk> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B1D9E8.BeroList-2.5.9@sob.htdig.org> Date: Fri, 29 Jan 1999 15:51:16 -0000 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2232.9) Content-Type: text/plain Subject: [htdig3-dev] htsearch as a variable?

is it reasonable to add a config variable to describe the name of the "htsearch" program

When a new release comes out, I install all of it into a new directory which works well but the htsearch overwrites my old CGI_BIN. If I could configure the name of the "htsearch" such as "htsearch-htdig.3.0.1b4" the install would be very clean.

I can of course have a different CGI_BIN and edit the web server configuration file, or change the name of the htsearch and the search page prior to installing but perhaps it would be nice to have a variable in the Configure file ?

CGIBIN_DIR= /home1/www-data/cgi-bin CGIBIN_PROG= htsearch-htdig.3.0.1b4

Regards S.Budd@ic.ac.uk ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id IAA24072 for <andrew@contigo.com>; Fri, 29 Jan 1999 08:35:07 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id IAA24273; Fri, 29 Jan 1999 08:36:28 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B1E38E.BeroList-2.5.9@sob.htdig.org> Date: Fri, 29 Jan 1999 11:34:25 -0500 (EST) In-Reply-To: <36B1D9E8.BeroList-2.5.9@sob.htdig.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: Re: [htdig3-dev] htsearch as a variable?

On Fri, 29 Jan 1999, Budd, S. wrote:

> a new directory which works well but the htsearch > overwrites my old CGI_BIN. If I could configure > the name of the "htsearch" such as "htsearch-htdig.3.0.1b4" > the install would be very clean.

I actually just copy the programs in by hand for new releases. This is more a job for autoconf's spiffy name-mangling features than a separate option. After all, it's what they're supposed to do.

I'll take a look after I figure out what happened to my databases last night. :-(

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id LAA01263 for <andrew@contigo.com>; Fri, 29 Jan 1999 11:08:33 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id KAA24697; Fri, 29 Jan 1999 10:46:47 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B202F0.BeroList-2.5.9@sob.htdig.org> In-Reply-To: <36B1FBFD.BeroList-2.5.9@sob.htdig.org> References: <36B14A41.BeroList-2.5.9@sob.htdig.org> from "Geoff Hutchison" at Jan 29, 99 00:39:41 am Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 29 Jan 1999 13:36:30 -0400 Subject: Re: [htdig3-dev] Excerpts and Punctuation

>around. Oddly enough, it seems to work either way, even though Pattern() >does make use of the existing trans table.

That is odd, but I agree with the fix.

>In the process of poking >around in IgnoreCase(), I think I uncovered a memory leak. It's probably >small and inconsequential, but it seems IgnoreCase() should do this: > > if (local_alloc) > delete [] trans;

This is a leak. I just fixed it. Even small leaks should be fixed, especially in something as widely-used as StringMatch.

>before allocating a new table. The other change in my patch is to move the >pos--; back to where it was, just after the loop that overincremented it.

I didn't think it really mattered, but I guess this is a more appropriate place since it's easier to understand why you need to do "pos--;"

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id LAA02746 for <andrew@contigo.com>; Fri, 29 Jan 1999 11:36:10 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id LAA24892; Fri, 29 Jan 1999 11:17:33 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B20B05.BeroList-2.5.9@sob.htdig.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 29 Jan 1999 14:09:20 -0400 Subject: [htdig3-dev] Moving towards release

On a related note to my last message...

This weekend I'm going to start in on the documentation updates. Much of it has been taken care of by Gilles and Hans-Peter and all. But this means I probably won't get a chance to move the compression stuff in DocumentRef.cc to only act on the DocHead methods.

So... Does anyone (Randy perhaps?) want to submit a patch to move compression out of the Serialize/Deserialize methods and move them to DocHead? ;-) It should be pretty simple and the move should ensure we only compress/decompress that field when necessary. The speedup should be significant and the hit on compression ratios should be pretty small.

Thanks in advance, -Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:17 1999 Return-Path: <grdetil@scrc.umanitoba.ca> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id KAA00611 for <andrew@contigo.com>; Fri, 29 Jan 1999 10:54:52 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id KAA24551; Fri, 29 Jan 1999 10:16:03 -0800 (PST) From: Gilles Detillieux <grdetil@scrc.umanitoba.ca> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B1FBFD.BeroList-2.5.9@sob.htdig.org> Date: Fri, 29 Jan 1999 12:12:57 -0600 (CST) In-Reply-To: <36B14A41.BeroList-2.5.9@sob.htdig.org> from "Geoff Hutchison" at Jan 29, 99 00:39:41 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Re: [htdig3-dev] Excerpts and Punctuation

According to Geoff Hutchison: > OK, I promised a patch for the valid_punctuation problem. Here's a patch > that adds the original user input, with punctuation, to the StringMatch > used for excerpts. > > However, I just noticed excerpt hilighting seems broken on my system. So I > can't test it out. :-( I did put in debugging output, so I know it's > setting the StringMatch correctly. > > If someone could test this, I'd appreciate it. If someone can figure out > why my excerpt hilighting isn't working, I'd be very, very happy.

It's working fine on my system, with the new databases. Your patch seems to work fine as well. I've made a couple small changes, in the patch below. When I saw the IgnoreCase() after Pattern(), instead of before, it just looked wrong to me, because everywhere else it's the other way around. Oddly enough, it seems to work either way, even though Pattern() does make use of the existing trans table. In the process of poking around in IgnoreCase(), I think I uncovered a memory leak. It's probably small and inconsequential, but it seems IgnoreCase() should do this:

if (local_alloc) delete [] trans;

before allocating a new table. The other change in my patch is to move the pos--; back to where it was, just after the loop that overincremented it.

--- htsearch/htsearch.cc.geoff Fri Jan 29 10:01:00 1999 +++ htsearch/htsearch.cc Fri Jan 29 10:29:10 1999 @@ -280,8 +280,8 @@ // Assemble the full pattern for excerpt matching and highlighting // origPattern += logicalPattern; - searchWordsPattern.Pattern(origPattern); searchWordsPattern.IgnoreCase(); + searchWordsPattern.Pattern(origPattern); if (debug) cout << "Excerpt pattern: " << origPattern << "\n"; // @@ -466,6 +466,7 @@ word << (char) t; t = *pos++; } + pos--; if (boolean && mystrcasecmp(word.get(), "and") == 0) { tempWords.Add(new WeightWord("&", -1.0)); @@ -495,7 +496,6 @@ tempWords.Add(ww); } } - pos--; break; } }

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:17 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id LAA02587
	for <andrew@contigo.com>; Fri, 29 Jan 1999 11:35:13 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id KAA24626;
	Fri, 29 Jan 1999 10:35:07 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B20075.BeroList-2.5.9@sob.htdig.org>
Date: Fri, 29 Jan 1999 12:32:50 -0600 (CST)
In-Reply-To: <36B1BE6E.BeroList-2.5.9@sob.htdig.org> from "Geoff Hutchison" at Jan 29, 99 08:54:47 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: Re: [htdig3-dev] Odd comment...

According to Geoff Hutchison: > > > >Nothing odd about it - just plain Dutch ;-) > > > >It translates roughly to: " something should be inserted here but I didn't > >do that" (no reason given but maybe the context can give you a clue) > > OK, I feel ashamed. I *thought* I had picked up some Dutch from last summer. > > The context didn't tell me much. Now I'll need to figure out what > "something" was. :-)

I was wondering what that comment meant! It seems it first appeared in 3.1.0b1. A lot of the revisions then were put in by "turtle", including this one:

// Revision 1.4 1998/06/21 23:20:09 turtle // patches by Esa and Jesse to add BerkeleyDB and Prefix searching

So maybe turtle, Esa or Jesse can explain?

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:17 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id KAA00687
	for <andrew@contigo.com>; Fri, 29 Jan 1999 10:56:06 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id KAA24651;
	Fri, 29 Jan 1999 10:43:44 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B20285.BeroList-2.5.9@sob.htdig.org>
Date: Fri, 29 Jan 1999 12:40:25 -0600 (CST)
In-Reply-To: <36B1E38E.BeroList-2.5.9@sob.htdig.org> from "Geoff Hutchison" at Jan 29, 99 11:34:25 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: Re: [htdig3-dev] htsearch as a variable?

According to Geoff Hutchison: > I actually just copy the programs in by hand for new releases. This is > more a job for autoconf's spiffy name-mangling features than a separate > option. After all, it's what they're supposed to do. > > I'll take a look after I figure out what happened to my databases last > night. :-(

Is it possible that when copying the programs by hand, you missed one? If htdig and htmerge are not from the same build, it seems possible they'd mess up the DBs if there are incompatibilities between them.

If all the programs you're running are the most recent revisions, it might be useful to compare your whole current source tree with the 012799 snapshot (diff -r). I'm using the 012799 snapshot with the few patches I sent you yesterday & today, plus your punctuation patch, and it's working fine here. Mind you, I'm not using most of the new bleeding-edge options like compression!

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:17 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id LAA01986
	for <andrew@contigo.com>; Fri, 29 Jan 1999 11:21:24 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id LAA24866;
	Fri, 29 Jan 1999 11:06:44 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B20819.BeroList-2.5.9@sob.htdig.org>
Date: Fri, 29 Jan 1999 14:04:18 -0500 (EST)
In-Reply-To: <36B20285.BeroList-2.5.9@sob.htdig.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: [htdig3-dev] woes

> Is it possible that when copying the programs by hand, you missed one? > If htdig and htmerge are not from the same build, it seems possible > they'd mess up the DBs if there are incompatibilities between them.

No, I have a script that copies htdig/htdig htmerge/htmerge htfuzzy/htfuzzy and htnotify/htnotify into place. It then copies htsearch/htsearch into cgi-bin.

> and it's working fine here. Mind you, I'm not using most of the new > bleeding-edge options like compression!

I just pulled out the zlib stuff, so I'm going to see if that's a problem. Everything got worse overnight as the databases seem corrupted! Running htmerge tells me I only have 1612 documents (v. 58,000+)! I'm rebuilding them from scratch and see if that helps.

Sigh.

On a brighter note, I think we're down to a few last bugs before 3.1.0. We should really hammer over the weekend to squeeze out every last drop.

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:18 1999 Return-Path: <andrews@contigo.com> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id MAA05271 for <andrew@contigo.com>; Fri, 29 Jan 1999 12:25:49 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA25021; Fri, 29 Jan 1999 12:20:16 -0800 (PST) From: Andrew Scherpbier <andrews@contigo.com> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B218B5.BeroList-2.5.9@sob.htdig.org> Sender: turtle@contigo.com Date: Fri, 29 Jan 1999 12:18:09 -0800 Organization: Contigo Software X-Mailer: Mozilla 4.5 [en] (X11; I; Linux 2.2.0 i686) X-Accept-Language: en MIME-Version: 1.0 References: <36B20075.BeroList-2.5.9@sob.htdig.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Subject: Re: [htdig3-dev] Odd comment...

Gilles Detillieux wrote: > > According to Geoff Hutchison: > > > > > > >Nothing odd about it - just plain Dutch ;-) > > > > > >It translates roughly to: " something should be inserted here but I didn't > > >do that" (no reason given but maybe the context can give you a clue) > > > > OK, I feel ashamed. I *thought* I had picked up some Dutch from last summer. > > > > The context didn't tell me much. Now I'll need to figure out what > > "something" was. :-) > > I was wondering what that comment meant! It seems it first appeared in > 3.1.0b1. A lot of the revisions then were put in by "turtle", including > this one: > > // Revision 1.4 1998/06/21 23:20:09 turtle > // patches by Esa and Jesse to add BerkeleyDB and Prefix searching > > So maybe turtle, Esa or Jesse can explain? >

turtle == Andrew Scherpbier == me...

I remember putting in the patches that Esa and Jesse sent. Since Jesse is Dutch, I'd "blame" him! :-)

-- 
Andrew Scherpbier <andrews@contigo.com>
Contigo Software <http://www.contigo.com/>
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:18 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id MAA07023
	for <andrew@contigo.com>; Fri, 29 Jan 1999 12:54:40 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA25094;
	Fri, 29 Jan 1999 12:45:06 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B21E87.BeroList-2.5.9@sob.htdig.org>
Date: Fri, 29 Jan 1999 14:42:44 -0600 (CST)
In-Reply-To: <36B1BA59.BeroList-2.5.9@sob.htdig.org> from "Ric Klaren" at Jan 29, 99 02:41:31 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: Re: [htdig3-dev] Buildroot + solaris 2.6 patch

According to Ric Klaren: > Here's a patch against htdig-3.1.0b4. It adds a the possibility to fake > the install into another directory. > Nice for building rpm packages and the likes.

Neat idea. It seems, though, that you should also change the call to mkinstalldirs in Makefile.in, as well as all the lines that install stuff in COMMON_DIR and SEARCH_DIR, to use INSTALL_ROOT there too.

For the RPMs I put together, I used a different approach, which I borrowed from Mihai Ibanescu, who put together an RPM for ht://Dig 3.0.8b2. This involve patching CONFIG.in to use all the directories I want, and prefixing them with $(ROOT). Then, in the spec file, I can configure things like so:

CFLAGS="$RPM_OPT_FLAGS" ./configure --prefix=/usr \ --bindir=/usr/sbin --libexec=/usr/lib --libdir=/usr/lib \ --mandir=/usr/man --sysconfdir=/etc/htdig

and install (after making the directories I want), like so:

make ROOT=$RPM_BUILD_ROOT install

The only problem is that after this, the rundig script and htdig.conf are installed with the wrong directory names inside them. They contain the BuildRoot directory names in the various directory names they use. I just replace these two files with the correct ones, which are included as source files in the RPMs, but I could just as easily use sed to strip out the BuildRoot directory names. I replace them because I end up configuring them a little differently anyway.

Your approach would make things a little bit cleaner, though I'd still end up patching CONFIG.in to get FSSTND compliant installation directories for Linux.

If you want to see how I've set up my rpms, you can see them, and the individual spec and source files, on my web site at:

http://www.scrc.umanitoba.ca/htdig/rpms/

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:18 1999
Return-Path: <grdetil@scrc.umanitoba.ca>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id NAA07595
	for <andrew@contigo.com>; Fri, 29 Jan 1999 13:07:41 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id MAA25124;
	Fri, 29 Jan 1999 12:57:39 -0800 (PST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B221BA.BeroList-2.5.9@sob.htdig.org>
Date: Fri, 29 Jan 1999 14:55:21 -0600 (CST)
In-Reply-To: <36B20B05.BeroList-2.5.9@sob.htdig.org> from "Geoff Hutchison" at Jan 29, 99 02:09:20 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: Re: [htdig3-dev] Moving towards release

According to Geoff Hutchison: > This weekend I'm going to start in on the documentation updates. Much of it > has been taken care of by Gilles and Hans-Peter and all. But this means I > probably won't get a chance to move the compression stuff in DocumentRef.cc > to only act on the DocHead methods. > > So... Does anyone (Randy perhaps?) want to submit a patch to move > compression out of the Serialize/Deserialize methods and move them to > DocHead? ;-) It should be pretty simple and the move should ensure we only > compress/decompress that field when necessary. The speedup should be > significant and the hit on compression ratios should be pretty small. > > Thanks in advance,

Another small change I'd suggest to the compression stuff would be not to statically allocate the c_buffer, but instead dynamically allocate it the first time you need to compress or uncompress. 60K is a fair bit of static data to sit around if you're not using it.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:18 1999
Return-Path: <webmaster@javawoman.com>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id NAA09723
	for <andrew@contigo.com>; Fri, 29 Jan 1999 13:54:24 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id NAA25176;
	Fri, 29 Jan 1999 13:20:07 -0800 (PST)
From: Marjolein Katsma <webmaster@javawoman.com>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B22627.BeroList-2.5.9@sob.htdig.org>
X-Sender: javawoma@pop.javawoman.com
X-Mailer: QUALCOMM Windows Eudora Pro Version 4.1 
Date: Fri, 29 Jan 1999 21:58:26 +0100
In-Reply-To: <36B1BE6E.BeroList-2.5.9@sob.htdig.org>
References: <36B1753B.BeroList-2.5.9@sob.htdig.org>
 <36B14A41.BeroList-2.5.9@sob.htdig.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Subject: Re: [htdig3-dev] Odd comment...

Geoff,

At 08:54 1999-01-29 -0400, you wrote:

>OK, I feel ashamed. I *thought* I had picked up some Dutch from last summer.

No need.

> >The context didn't tell me much. Now I'll need to figure out what >"something" was. :-)

I was thinking of Jesse, too. Why don't you ask him?

Cheers,

Marjolein Katsma webmaster@javawoman.com Java Woman - http://javawoman.com/ ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:18 1999 Return-Path: <gumby@cafes.net> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id NAA08462 for <andrew@contigo.com>; Fri, 29 Jan 1999 13:29:25 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id NAA25207; Fri, 29 Jan 1999 13:29:27 -0800 (PST) From: Randy Winch <gumby@cafes.net> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B22860.BeroList-2.5.9@sob.htdig.org> Sender: randy@mail.cafes.net Date: Fri, 29 Jan 1999 15:30:04 -0600 X-Mailer: Mozilla 4.08 [en] (X11; I; Linux 2.0.35 i686) MIME-Version: 1.0 References: <36B20B05.BeroList-2.5.9@sob.htdig.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Subject: Re: [htdig3-dev] Moving towards release

Geoff Hutchison wrote: > So... Does anyone (Randy perhaps?) want to submit a patch to move > compression out of the Serialize/Deserialize methods and move them to > DocHead? ;-) It should be pretty simple and the move should ensure we only > compress/decompress that field when necessary. The speedup should be > significant and the hit on compression ratios should be pretty small.

Will do, off to gather the latest snapshot...

Randy ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:18 1999 Return-Path: <grdetil@scrc.umanitoba.ca> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id SAA23242 for <andrew@contigo.com>; Fri, 29 Jan 1999 18:13:24 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id SAA03110; Fri, 29 Jan 1999 18:12:39 -0800 (PST) From: Gilles Detillieux <grdetil@scrc.umanitoba.ca> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B26AA6.BeroList-2.5.9@sob.htdig.org> Date: Fri, 29 Jan 1999 16:41:49 -0600 (CST) In-Reply-To: <36B0B816.BeroList-2.5.9@sob.htdig.org> from "Gilles Detillieux" at Jan 28, 99 01:17:07 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Re: [htdig3-dev] new rundig script

D'oh! I just found a bug in the patch for the new rundig script I posted yesterday. I forgot to use a "head -1" or "sed 1q" to grab only the first line from the ls -t. Sorry 'bout that! Here's the corrected patch:

--- rundig.geoff Wed Jan 6 21:17:15 1999 +++ rundig Fri Jan 29 16:37:43 1999 @@ -3,13 +3,23 @@ # # rundig # -# $Id: rundig,v 1.5 1999/01/07 03:17:15 ghutchis Exp $ +# $Id: rundig,v 1.6 1999/01/29 16:37:35 ghutchis Exp $ # # This is a sample script to create a search database for ht://Dig. # -if [ "$1" = "-v" ]; then - verbose=-v -fi +DBDIR=@DATABASE_DIR@ +COMMONDIR=@COMMON_DIR@ +BINDIR=@BIN_DIR@ + +stats= opts= alt= +for arg +do + case "$arg" in + -a) alt="$arg" ;; + -s) stats="$arg" ;; + *) opts="$opts $arg" ;; # e.g. -v or -c config + esac +done # # Set the TMPDIR variable if you want htmerge to put files in a location @@ -18,25 +28,36 @@ # on some systems, /tmp is a memory mapped filesystem that takes away # from virtual memory. # -TMPDIR=@DATABASE_DIR@ +TMPDIR=$DBDIR export TMPDIR -@BIN_DIR@/htdig -i $verbose -s -@BIN_DIR@/htmerge $verbose -s -@BIN_DIR@/htnotify $verbose +$BINDIR/htdig -i $opts $stats $alt +$BINDIR/htmerge $opts $stats $alt +case "$alt" in +-a) + ( cd $DBDIR && test -f db.docdb.work && + for f in *.work + do + mv -f $f `basename $f .work` + done ) ;; +esac +$BINDIR/htnotify $opts +$BINDIR/htfuzzy $opts soundex metaphone # # Create the endings and synonym databases if they don't exist -# or if they're older than the files they're generated from! +# or if they're older than the files they're generated from. +# These databases are semi-static, so even if pages change, +# these databases will not need to be rebuilt. # - -# Do they exist? -if [ ! -f @COMMON_DIR@/word2root.db ] +if [ "`ls -t $COMMONDIR/english.0 $COMMONDIR/word2root.db 2>/dev/null | sed 1q`" = \ + "$COMMONDIR/english.0" ] then - @BIN_DIR@/htfuzzy $verbose endings + $BINDIR/htfuzzy $opts endings fi -if [ ! -f @COMMON_DIR@/synonyms.db ] +if [ "`ls -t $COMMONDIR/synonyms $COMMONDIR/synonyms.db 2>/dev/null | sed 1q`" = \ + "$COMMONDIR/synonyms" ] then - @BIN_DIR@/htfuzzy $verbose synonyms + $BINDIR/htfuzzy $opts synonyms fi

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:18 1999
Return-Path: <andrews@contigo.com>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id RAA22431
	for <andrew@contigo.com>; Fri, 29 Jan 1999 17:50:08 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id RAA00212;
	Fri, 29 Jan 1999 17:46:04 -0800 (PST)
From: Andrew Scherpbier <andrews@contigo.com>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B264BA.BeroList-2.5.9@sob.htdig.org>
Sender: turtle@contigo.com
Date: Fri, 29 Jan 1999 17:45:28 -0800
Organization: Contigo Software
X-Mailer: Mozilla 4.5 [en] (X11; I; Linux 2.2.0 i686)
X-Accept-Language: en
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Subject: [htdig3-dev] new hardware

Don't know if any of you noticed, but sob.htdig.org was down for a couple of hours. It had a problem with its ethernet card.

Well, now it is a Celeron 333 with 64M. Paid $281 at the store down the street for: motherboard, celeron 333, 64MB dimm, sales tax, 8MB AGP graphics, 16bit sound (those last two are on the motherboard). I reused the case and harddrive from the old machine. Gotta love computer prices!

-- 
Andrew Scherpbier <andrews@contigo.com>
Contigo Software <http://www.contigo.com/>
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:18 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id UAA27744
	for <andrew@contigo.com>; Fri, 29 Jan 1999 20:48:18 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id UAA03517;
	Fri, 29 Jan 1999 20:48:16 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B28F13.BeroList-2.5.9@sob.htdig.org>
In-Reply-To: <36B151D9.BeroList-2.5.9@sob.htdig.org>
  by donar.teuto.de with SMTP; 29 Jan 1999 06: 12:50 -0000
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Fri, 29 Jan 1999 23:30:54 -0400
Subject: Re: [htdig3-dev] WWW Library catalog with htdig

>Is there any way to create a category search with htdig?

Actually I consulted for some folks looking for a similar thing. I suggested setting up multiple config files and databases for each category. The shell scripts in contrib/multidig/ were the result (they need some minor cleanups for 3.1.0).

This is becoming a FAQ, so perhaps we want to improve support for this in the next release.

Cheers, -Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:18 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id UAA27759 for <andrew@contigo.com>; Fri, 29 Jan 1999 20:48:20 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id UAA03523; Fri, 29 Jan 1999 20:48:18 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B28F13.BeroList-2.5.9@sob.htdig.org> In-Reply-To: <36B151E5.BeroList-2.5.9@sob.htdig.org> by donar.teuto.de with SMTP; 29 Jan 1999 06: 13:07 -0000 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 29 Jan 1999 23:32:20 -0400 Subject: Re: [htdig3-dev] How to get a list of URLs together with the Title

>How is that possible without visiting all pages manually? >Any htdig Log file that can be used or generated?

You can have ht://Dig dump an ASCII version of the database, including URL, Title, Head (excerpt), etc. Another program can easily pick the fields you want from the output.

Cheers, -Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:18 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id UAA27745 for <andrew@contigo.com>; Fri, 29 Jan 1999 20:48:19 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id UAA03533; Fri, 29 Jan 1999 20:48:19 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B28F14.BeroList-2.5.9@sob.htdig.org> In-Reply-To: <36B1A8DA.BeroList-2.5.9@sob.htdig.org> with SMTP (XT-PP) with ESMTP; Fri, 29 Jan 1999 13: 16:58 +0100 References: <36B19F3A.BeroList-2.5.9@sob.htdig.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 29 Jan 1999 23:33:59 -0400 Subject: Re: [htdig3-dev] Parsing Ms Word

>Fourth, catdoc sometimes fails dramaticly when a non-Word >file end with .doc and gets parsed by catdoc. It crashed >htdig at my place...

Hmm. So the file was sent with the incorrect mime-type? Is there a way we can detect this easily?

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:18 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id UAA27753 for <andrew@contigo.com>; Fri, 29 Jan 1999 20:48:19 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id UAA03541; Fri, 29 Jan 1999 20:48:21 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B28F15.BeroList-2.5.9@sob.htdig.org> In-Reply-To: <36B1A24C.BeroList-2.5.9@sob.htdig.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 29 Jan 1999 23:35:34 -0400 Subject: Re: [htdig3-dev] Updating only a part of the database

>It indexes the right documents, but then it keeps in the database the old >files too. Is there a way to erase from the db all the documents with >pattern specified in the limits_urls_to or similar, by making possibile the >real updating?

What version are you using? This could be the bug we just fixed that leaves old files in the document database. Could you try the latest snapshot?

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:18 1999 Return-Path: <gumby@cafes.net> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id VAA28253 for <andrew@contigo.com>; Fri, 29 Jan 1999 21:08:42 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id VAA03585; Fri, 29 Jan 1999 21:08:43 -0800 (PST) From: Randy Winch <gumby@cafes.net> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B293DC.BeroList-2.5.9@sob.htdig.org> Sender: randy@mail.cafes.net Date: Fri, 29 Jan 1999 23:11:26 -0600 X-Mailer: Mozilla 4.08 [en] (X11; I; Linux 2.0.35 i686) MIME-Version: 1.0 References: <36B20B05.BeroList-2.5.9@sob.htdig.org> Content-Type: multipart/mixed; boundary="------------93B73B272DFB9EDE5BAD0F92" Subject: Re: [htdig3-dev] Moving towards release

This is a multi-part message in MIME format. --------------93B73B272DFB9EDE5BAD0F92 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit

Geoff Hutchison wrote: > So... Does anyone (Randy perhaps?) want to submit a patch to move > compression out of the Serialize/Deserialize methods and move them to > DocHead? ;-) It should be pretty simple and the move should ensure we only > compress/decompress that field when necessary. The speedup should be > significant and the hit on compression ratios should be pretty small.

Here is my first attempt...It seems to work with a minor increase in the document db size.

Randy --------------93B73B272DFB9EDE5BAD0F92 Content-Type: application/octet-stream; name="DocumentRef.h.diff" Content-Disposition: attachment; filename="DocumentRef.h.diff" Content-Transfer-Encoding: base64

KioqIERvY3VtZW50UmVmLmgub2xkCUZyaSBKYW4gMjkgMTU6NTI6MzcgMTk5OQotLS0gRG9j dW1lbnRSZWYuaAlGcmkgSmFuIDI5IDIyOjI4OjA5IDE5OTkKKioqKioqKioqKioqKioqIGVu dW0gUmVmZXJlbmNlU3RhdGUKKioqIDE5LDI0ICoqKioKLS0tIDE5LDMzIC0tLS0KICAgICAg UmVmZXJlbmNlX25vaW5kZXgKICB9OwogIAorICNpZmRlZiBIQVZFX0xJQloKKyBlbnVtIEhl YWRTdGF0ZQorIHsKKyAgICAgRW1wdHksCisgICAgIENvbXByZXNzZWQsCisgICAgIFVuY29t cHJlc3NlZAorIH07CisgI2VuZGlmCisgCiAgY2xhc3MgRG9jdW1lbnRSZWYgOiBwdWJsaWMg T2JqZWN0CiAgewogICAgICBwdWJsaWM6CioqKioqKioqKioqKioqKiBjbGFzcyBEb2N1bWVu dFJlZiA6IHB1YmxpYyBPYmplY3QKKioqIDQyLDQ4ICoqKioKICAgICAgY2hhcgkJKkRvY1VS TCgpCQkJe3JldHVybiBkb2NVUkw7fQogICAgICB0aW1lX3QJCURvY1RpbWUoKQkJCXtyZXR1 cm4gZG9jVGltZTt9CiAgICAgIGNoYXIJCSpEb2NUaXRsZSgpCQkJe3JldHVybiBkb2NUaXRs ZTt9CiEgICAgIGNoYXIJCSpEb2NIZWFkKCkJCQl7cmV0dXJuIGRvY0hlYWQ7fQogICAgICBj aGFyICAgICAgICAgICAgICAgICpEb2NNZXRhRHNjKCkgICAgICAgICAgICAgICAgICAge3Jl dHVybiBkb2NNZXRhRHNjO30KICAgICAgdGltZV90CQlEb2NBY2Nlc3NlZCgpCQkJe3JldHVy biBkb2NBY2Nlc3NlZDt9CiAgICAgIGludAkJCURvY0xpbmtzKCkJCQl7cmV0dXJuIGRvY0xp bmtzO30KLS0tIDUxLDU3IC0tLS0KICAgICAgY2hhcgkJKkRvY1VSTCgpCQkJe3JldHVybiBk b2NVUkw7fQogICAgICB0aW1lX3QJCURvY1RpbWUoKQkJCXtyZXR1cm4gZG9jVGltZTt9CiAg ICAgIGNoYXIJCSpEb2NUaXRsZSgpCQkJe3JldHVybiBkb2NUaXRsZTt9CiEgICAgIGNoYXIJ CSpEb2NIZWFkKCk7CiAgICAgIGNoYXIgICAgICAgICAgICAgICAgKkRvY01ldGFEc2MoKSAg ICAgICAgICAgICAgICAgICB7cmV0dXJuIGRvY01ldGFEc2M7fQogICAgICB0aW1lX3QJCURv Y0FjY2Vzc2VkKCkJCQl7cmV0dXJuIGRvY0FjY2Vzc2VkO30KICAgICAgaW50CQkJRG9jTGlu a3MoKQkJCXtyZXR1cm4gZG9jTGlua3M7fQoqKioqKioqKioqKioqKiogY2xhc3MgRG9jdW1l bnRSZWYgOiBwdWJsaWMgT2JqZWN0CioqKiA2NCw3MCAqKioqCiAgICAgIHZvaWQJCURvY1VS TChjaGFyICp1KQkJCXtkb2NVUkwgPSB1O30KICAgICAgdm9pZAkJRG9jVGltZSh0aW1lX3Qg dCkJCXtkb2NUaW1lID0gdDt9CiAgICAgIHZvaWQJCURvY1RpdGxlKGNoYXIgKnQpCQl7ZG9j VGl0bGUgPSB0O30KISAgICAgdm9pZAkJRG9jSGVhZChjaGFyICpoKQkJe2RvY0hlYWQgPSBo O30KICAgICAgdm9pZCAgICAgICAgICAgICAgICBEb2NNZXRhRHNjKGNoYXIgKm1kKSAgICAg ICAgICAgIHtkb2NNZXRhRHNjID0gbWQ7fQogICAgICB2b2lkCQlEb2NBY2Nlc3NlZCh0aW1l X3QgdCkJCXtkb2NBY2Nlc3NlZCA9IHQ7fQogICAgICB2b2lkCQlEb2NMaW5rcyhpbnQgbCkJ CXtkb2NMaW5rcyA9IGw7fQotLS0gNzMsNzkgLS0tLQogICAgICB2b2lkCQlEb2NVUkwoY2hh ciAqdSkJCQl7ZG9jVVJMID0gdTt9CiAgICAgIHZvaWQJCURvY1RpbWUodGltZV90IHQpCQl7 ZG9jVGltZSA9IHQ7fQogICAgICB2b2lkCQlEb2NUaXRsZShjaGFyICp0KQkJe2RvY1RpdGxl ID0gdDt9CiEgICAgIHZvaWQJCURvY0hlYWQoY2hhciAqaCk7CiAgICAgIHZvaWQgICAgICAg ICAgICAgICAgRG9jTWV0YURzYyhjaGFyICptZCkgICAgICAgICAgICB7ZG9jTWV0YURzYyA9 IG1kO30KICAgICAgdm9pZAkJRG9jQWNjZXNzZWQodGltZV90IHQpCQl7ZG9jQWNjZXNzZWQg PSB0O30KICAgICAgdm9pZAkJRG9jTGlua3MoaW50IGwpCQl7ZG9jTGlua3MgPSBsO30KKioq KioqKioqKioqKioqIGNsYXNzIERvY3VtZW50UmVmIDogcHVibGljIE9iamVjdAoqKiogMTQ3 LDE1NSAqKioqCiAgICAgIGludAkJCWRvY1Njb3JlOwogICAgICAvLyBUaGlzIGlzIHRoZSBu ZWFyZXN0IGFuY2hvciBmb3IgdGhlIHNlYXJjaCB3b3JkLgogICAgICBpbnQJCQlkb2NBbmNo b3I7CiEgICAgIC8vIFN0YXRpYyBtZW1iZXIgdmFyaWFibGUgc28gd2UgZ2V0IG9ubHkgb25l IGNvcHkKISAgICAgLy8gVXNlZCB0byBidWZmZXIgemxpYiBjb21wcmVzc2lvbgohICAgICBz dGF0aWMgdW5zaWduZWQgY2hhciBjX2J1ZmZlcls2MDAwMF07CiAgfTsKICAKICAjZW5kaWYK LS0tIDE1NiwxNzAgLS0tLQogICAgICBpbnQJCQlkb2NTY29yZTsKICAgICAgLy8gVGhpcyBp cyB0aGUgbmVhcmVzdCBhbmNob3IgZm9yIHRoZSBzZWFyY2ggd29yZC4KICAgICAgaW50CQkJ ZG9jQW5jaG9yOwohICNpZmRlZiBIQVZFX0xJQloKISAgICAgLy8KISAgICAgLy8gQ29tcHJl c3Npb24gZnVuY3Rpb25zCiEgICAgIC8vCiEgICAgIC8vc3RhdGljIHVuc2lnbmVkIGNoYXIg Y19idWZmZXJbMzIwMDBdOwohICAgICBpbnQgQ29tcHJlc3MoU3RyaW5nJiBzKTsKISAgICAg aW50IERlY29tcHJlc3MoU3RyaW5nICZzKTsKISAgICAgSGVhZFN0YXRlIGRvY0hlYWRTdGF0 ZTsKISAjZW5kaWYKICB9OwogIAogICNlbmRpZgo= --------------93B73B272DFB9EDE5BAD0F92 Content-Type: application/octet-stream; name="DocumentRef.cc.diff" Content-Disposition: attachment; filename="DocumentRef.cc.diff" Content-Transfer-Encoding: base64

KioqIERvY3VtZW50UmVmLmNjLm9sZAlGcmkgSmFuIDI5IDE1OjUyOjI2IDE5OTkKLS0tIERv Y3VtZW50UmVmLmNjCUZyaSBKYW4gMjkgMjM6MDE6NDQgMTk5OQoqKioqKioqKioqKioqKioK KioqIDE4LDMwICoqKioKICAKICAjaWZkZWYgSEFWRV9MSUJaCiAgI2luY2x1ZGUgPHpsaWIu aD4KLSAjZW5kaWYKICAKICBleHRlcm4gQ29uZmlndXJhdGlvbiBjb25maWc7CiAgCiEgLy8g U3RhdGljIG1lbWJlciB2YXJpYWJsZSBzbyB3ZSBnZXQgb25seSBhIHNpbmdsZSBjb3B5CiEg Ly8gVXNlZCB0byBidWZmZXIgdGhlIHpsaWIgY29tcHJlc3Npb24KISBzdGF0aWMgdW5zaWdu ZWQgY2hhciBEb2N1bWVudFJlZjo6Y19idWZmZXJbNjAwMDBdOwogIAogIC8vKioqKioqKioq KioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioq KioqKioqKioqKioqKioKICAvLyBEb2N1bWVudFJlZjo6RG9jdW1lbnRSZWYoKQotLS0gMTgs MTI5IC0tLS0KICAKICAjaWZkZWYgSEFWRV9MSUJaCiAgI2luY2x1ZGUgPHpsaWIuaD4KICAK ICBleHRlcm4gQ29uZmlndXJhdGlvbiBjb25maWc7CiAgCiEgLy91bnNpZ25lZCBjaGFyIERv Y3VtZW50UmVmOjpjX2J1ZmZlclszMjAwMF07CiEgLy8KISAvLyBDb21wcmVzcyBGdW5jdGlv bgohIC8vCiEgaW50IERvY3VtZW50UmVmOjpDb21wcmVzcyhTdHJpbmcgJnMpIHsKISAgIHN0 YXRpYyBpbnQgY2Y9Y29uZmlnLlZhbHVlKCJjb21wcmVzc2lvbl9sZXZlbCIsMCk7ICAgIAoh ICAgaWYgKGNmKSB7CiEgICAgIC8vCiEgICAgIC8vIE5vdyBjb21wcmVzcyBzIGludG8gY19z CiEgICAgIC8vCiEgICAgIHVuc2lnbmVkIGNoYXIgY19idWZmZXJbMTYzODRdOwohICAgICBT dHJpbmcgY19zOwohICAgICB6X3N0cmVhbSBjX3N0cmVhbTsgLyogY29tcHJlc3Npb24gc3Ry ZWFtICovCiEgICAgIGNfc3RyZWFtLnphbGxvYz0oYWxsb2NfZnVuYykwOwohICAgICBjX3N0 cmVhbS56ZnJlZT0oZnJlZV9mdW5jKTA7CiEgICAgIGNfc3RyZWFtLm9wYXF1ZT0odm9pZHBm KTA7CiEgICAgIC8vIEdldCBjb21wcmVzc2lvbiBmYWN0b3IsIGRlZmF1bHQgdG8gYmVzdAoh ICAgICBpZiAoY2Y8LTEpIGNmPS0xOyBlbHNlIGlmIChjZj45KSBjZj05OwohICAgICBpbnQg ZXJyPWRlZmxhdGVJbml0KCZjX3N0cmVhbSxjZik7CiEgICAgIGlmIChlcnIhPVpfT0spIHJl dHVybiAwOwohICAgICBpbnQgbGVuPXMubGVuZ3RoKCk7CiEgICAgIGNfc3RyZWFtLm5leHRf aW49KEJ5dGVmKikoY2hhciAqKXM7CiEgICAgIGNfc3RyZWFtLmF2YWlsX2luPWxlbjsKISAg ICAgd2hpbGUgKGVycj09Wl9PSyAmJiBjX3N0cmVhbS50b3RhbF9pbiE9KHVMb25nKWxlbikg ewohICAgICAgIGNfc3RyZWFtLm5leHRfb3V0PWNfYnVmZmVyOwohICAgICAgIGNfc3RyZWFt LmF2YWlsX291dD1zaXplb2YoY19idWZmZXIpOwohICAgICAgIGVycj1kZWZsYXRlKCZjX3N0 cmVhbSxaX05PX0ZMVVNIKTsKISAgICAgICBjX3MuYXBwZW5kKChjaGFyICopY19idWZmZXIs Y19zdHJlYW0ubmV4dF9vdXQtY19idWZmZXIpOwohICAgICB9CiEgICAgIC8vIEZpbmlzaCB0 aGUgc3RyZWFtCiEgICAgIGZvciAoOzspIHsKISAgICAgICBjX3N0cmVhbS5uZXh0X291dD1j X2J1ZmZlcjsKISAgICAgICBjX3N0cmVhbS5hdmFpbF9vdXQ9c2l6ZW9mKGNfYnVmZmVyKTsK ISAgICAgICBlcnI9ZGVmbGF0ZSgmY19zdHJlYW0sWl9GSU5JU0gpOwohICAgICAgIGNfcy5h cHBlbmQoKGNoYXIgKiljX2J1ZmZlcixjX3N0cmVhbS5uZXh0X291dC1jX2J1ZmZlcik7CiEg ICAgICAgaWYgKGVycj09Wl9TVFJFQU1fRU5EKSBicmVhazsKISAgICAgICAvL0NIRUNLX0VS UihlcnIsICJkZWZsYXRlIik7CiEgICAgIH0KISAgICAgZXJyPWRlZmxhdGVFbmQoJmNfc3Ry ZWFtKTsgCiEgICAgIHM9Y19zOwohICAgfQohICAgcmV0dXJuIDE7CiEgfQohIAohIC8vCiEg Ly8gRGVjb21wcmVzcyByb3V0aW5lIHJldHVybnMgMCBpZiBkZWNvbXByZXNzZWQgMSBpZiBj b21wcmVzc2VkCiEgLy8KISBpbnQgRG9jdW1lbnRSZWY6OkRlY29tcHJlc3MoU3RyaW5nICZz KSB7CiEgICBzdGF0aWMgaW50IGNmPWNvbmZpZy5WYWx1ZSgiY29tcHJlc3Npb25fbGV2ZWwi LDApOyAgICAKISAgIGlmIChjZikgewohICAgICBTdHJpbmcgY19zOwohICAgICAvLyBEZWNv bXByZXNzIHN0cmVhbQohICAgICB1bnNpZ25lZCBjaGFyIGNfYnVmZmVyWzE2Mzg0XTsKISAg ICAgel9zdHJlYW0gZF9zdHJlYW07CiEgICAgIGRfc3RyZWFtLnphbGxvYz0oYWxsb2NfZnVu YykwOwohICAgICBkX3N0cmVhbS56ZnJlZT0oZnJlZV9mdW5jKTA7CiEgICAgIGRfc3RyZWFt Lm9wYXF1ZT0odm9pZHBmKTA7CiEgCiEgICAgIGludCBsZW49cy5sZW5ndGgoKTsKISAgICAg ZF9zdHJlYW0ubmV4dF9pbj0oQnl0ZWYqKShjaGFyICopczsKISAgICAgZF9zdHJlYW0uYXZh aWxfaW49bGVuOwohIAohICAgICBpbnQgZXJyPWluZmxhdGVJbml0KCZkX3N0cmVhbSk7CiEg ICAgIGlmIChlcnIhPVpfT0spIHJldHVybiAxOwohIAohICAgICB3aGlsZSAoZXJyPT1aX09L ICYmIGRfc3RyZWFtLnRvdGFsX2luPGxlbikgewohICAgICAgIGRfc3RyZWFtLm5leHRfb3V0 PWNfYnVmZmVyOwohICAgICAgIGRfc3RyZWFtLmF2YWlsX291dD1zaXplb2YoY19idWZmZXIp OwohICAgICAgIGVycj1pbmZsYXRlKCZkX3N0cmVhbSxaX05PX0ZMVVNIKTsKISAgICAgICBj X3MuYXBwZW5kKChjaGFyICopY19idWZmZXIsZF9zdHJlYW0ubmV4dF9vdXQtY19idWZmZXIp OwohICAgICAgIGlmIChlcnI9PVpfU1RSRUFNX0VORCkgYnJlYWs7CiEgICAgIH0KISAKISAg ICAgZXJyPWluZmxhdGVFbmQoJmRfc3RyZWFtKTsKISAgICAgcz1jX3M7CiEgICB9CiEgICBy ZXR1cm4gMDsKISB9CiEgCiEgY2hhciAqRG9jdW1lbnRSZWY6OkRvY0hlYWQoKSB7CiEgICBp ZiAoZG9jSGVhZFN0YXRlPT1Db21wcmVzc2VkKSB7CiEgICAgIERlY29tcHJlc3MoZG9jSGVh ZCk7CiEgICAgIGRvY0hlYWRTdGF0ZT1VbmNvbXByZXNzZWQ7CiEgICB9CiEgICByZXR1cm4g ZG9jSGVhZDsKISB9CiEgCiEgdm9pZCBEb2N1bWVudFJlZjo6RG9jSGVhZChjaGFyICpoKSB7 CiEgICBkb2NIZWFkPWg7CiEgICBkb2NIZWFkU3RhdGU9ZG9jSGVhZC5sZW5ndGgoKT09MD9F bXB0eTpVbmNvbXByZXNzZWQ7CiEgfQohIAohICNlbHNlCiEgZXh0ZXJuIENvbmZpZ3VyYXRp b24gY29uZmlnOwohIAohIGlubGluZSBjaGFyICpEb2N1bWVudFJlZjo6RG9jSGVhZCgpIHsK ISAgIHJldHVybiBkb2NIZWFkOwohIH0KISAKISBpbmxpbmUgdm9pZCBEb2N1bWVudFJlZjo6 RG9jSGVhZChjaGFyICpoKSB7CiEgICBkb2NIZWFkPWg7CiEgfQohICNlbmRpZgogIAogIC8v KioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioq KioqKioqKioqKioqKioqKioqKioqKioKICAvLyBEb2N1bWVudFJlZjo6RG9jdW1lbnRSZWYo KQoqKioqKioqKioqKioqKiogdm9pZCBEb2N1bWVudFJlZjo6Q2xlYXIoKQoqKiogNjMsNjgg KioqKgotLS0gMTYyLDE3MCAtLS0tCiAgICAgIGRvY0FuY2hvcnMuRGVzdHJveSgpOwogICAg ICBkb2NIb3BDb3VudCA9IC0xOwogICAgICBkb2NCYWNrTGlua3MgPSAwOworICNpZmRlZiBI QVZFX0xJQloKKyAgICAgZG9jSGVhZFN0YXRlPUVtcHR5OworICNlbmRpZgogIH0KICAKICAK KioqKioqKioqKioqKioqIHZvaWQgRG9jdW1lbnRSZWY6OlNlcmlhbGl6ZShTdHJpbmcgJnMp CioqKiAxMDQsMTA5ICoqKioKLS0tIDIwNiwyMTcgLS0tLQogICAgICBpbnQJCWxlbmd0aDsK ICAgICAgU3RyaW5nCSpzdHI7CiAgCisgI2lmZGVmIEhBVkVfTElCWgorICAgICBpZiAoZG9j SGVhZFN0YXRlPT1VbmNvbXByZXNzZWQpIHsKKyAgICAgICBDb21wcmVzcyhkb2NIZWFkKTsK KyAgICAgICBkb2NIZWFkU3RhdGU9Q29tcHJlc3NlZDsKKyAgICAgfQorICNlbmRpZgogIC8v CiAgLy8gVGhlIGZvbGxvd2luZyBtYWNyb3MgbWFrZSB0aGUgc2VyaWFsaXphdGlvbiBwcm9j ZXNzIGEgbGl0dGxlIGVhc2llcgogIC8vIHRvIGZvbGxvdy4gIE5vdGUgdGhhdCBpZiBhbiBv YmplY3QgdG8gYmUgc2VyaWFsaXplZCBoYXMgdGhlIGRlZmF1bHQKKioqKioqKioqKioqKioq IHZvaWQgRG9jdW1lbnRSZWY6OlNlcmlhbGl6ZShTdHJpbmcgJnMpCioqKiAyMzksMjgxICoq KioKICAgICAgYWRkc3RyaW5nKERPQ19FTUFJTCwgcywgZG9jRW1haWwpOwogICAgICBhZGRz dHJpbmcoRE9DX05PVElGSUNBVElPTiwgcywgZG9jTm90aWZpY2F0aW9uKTsKICAgICAgYWRk c3RyaW5nKERPQ19TVUJKRUNULCBzLCBkb2NTdWJqZWN0KTsKLSAjaWZkZWYgSEFWRV9MSUJa Ci0gICAgIHN0YXRpYyBpbnQgY2Y9Y29uZmlnLlZhbHVlKCJjb21wcmVzc2lvbl9sZXZlbCIs MCk7ICAgIAotICAgICBpZiAoY2YpIHsKLSAgICAgICAvLwotICAgICAgIC8vIE5vdyBjb21w cmVzcyBzIGludG8gY19zCi0gICAgICAgLy8KLSAgICAgICBTdHJpbmcgY19zOwotICAgICAg IHpfc3RyZWFtIGNfc3RyZWFtOyAvKiBjb21wcmVzc2lvbiBzdHJlYW0gKi8KLSAgICAgICBj X3N0cmVhbS56YWxsb2M9KGFsbG9jX2Z1bmMpMDsKLSAgICAgICBjX3N0cmVhbS56ZnJlZT0o ZnJlZV9mdW5jKTA7Ci0gICAgICAgY19zdHJlYW0ub3BhcXVlPSh2b2lkcGYpMDsKLSAgICAg ICAvLyBHZXQgY29tcHJlc3Npb24gZmFjdG9yLCBkZWZhdWx0IHRvIGJlc3QKLSAgICAgICBp ZiAoY2Y8LTEpIGNmPS0xOyBlbHNlIGlmIChjZj45KSBjZj05OwotICAgICAgIGludCBlcnI9 ZGVmbGF0ZUluaXQoJmNfc3RyZWFtLGNmKTsKLSAgICAgICBpZiAoZXJyIT1aX09LKSByZXR1 cm47Ci0gICAgICAgaW50IGxlbj1zLmxlbmd0aCgpOwotICAgICAgIGNfc3RyZWFtLm5leHRf aW49KEJ5dGVmKikoY2hhciAqKXM7Ci0gICAgICAgY19zdHJlYW0uYXZhaWxfaW49bGVuOwot ICAgICAgIHdoaWxlIChlcnI9PVpfT0sgJiYgY19zdHJlYW0udG90YWxfaW4hPSh1TG9uZyls ZW4pIHsKLSAgICAgICAgIGNfc3RyZWFtLm5leHRfb3V0PWNfYnVmZmVyOwotICAgICAgICAg Y19zdHJlYW0uYXZhaWxfb3V0PXNpemVvZihjX2J1ZmZlcik7Ci0gICAgICAgICBlcnI9ZGVm bGF0ZSgmY19zdHJlYW0sWl9OT19GTFVTSCk7Ci0gICAgICAgICBjX3MuYXBwZW5kKChjaGFy ICopY19idWZmZXIsY19zdHJlYW0ubmV4dF9vdXQtY19idWZmZXIpOwotICAgICAgIH0KLSAg ICAgICAvLyBGaW5pc2ggdGhlIHN0cmVhbQotICAgICAgIGZvciAoOzspIHsKLSAgICAgICAg IGNfc3RyZWFtLm5leHRfb3V0PWNfYnVmZmVyOwotICAgICAgICAgY19zdHJlYW0uYXZhaWxf b3V0PXNpemVvZihjX2J1ZmZlcik7Ci0gICAgICAgICBlcnI9ZGVmbGF0ZSgmY19zdHJlYW0s Wl9GSU5JU0gpOwotICAgICAgICAgY19zLmFwcGVuZCgoY2hhciAqKWNfYnVmZmVyLGNfc3Ry ZWFtLm5leHRfb3V0LWNfYnVmZmVyKTsKLSAgICAgICAgIGlmIChlcnI9PVpfU1RSRUFNX0VO RCkgYnJlYWs7Ci0gICAgICAgICAvL0NIRUNLX0VSUihlcnIsICJkZWZsYXRlIik7Ci0gICAg ICAgfQotICAgICAgIGVyciA9IGRlZmxhdGVFbmQoJmNfc3RyZWFtKTsgCi0gICAgICAgcz1j X3M7Ci0gICAgIH0KLSAjZW5kaWYKICB9CiAgCiAgCi0tLSAzNDcsMzUyIC0tLS0KKioqKioq KioqKioqKioqIHZvaWQgRG9jdW1lbnRSZWY6OlNlcmlhbGl6ZShTdHJpbmcgJnMpCioqKiAy ODgsMzMzICoqKioKICB2b2lkIERvY3VtZW50UmVmOjpEZXNlcmlhbGl6ZShTdHJpbmcgJnN0 cmVhbSkKICB7CiAgICAgIENsZWFyKCk7Ci0gI2lmZGVmIEhBVkVfTElCWgotICAgICBjaGFy CSpzOwotICAgICBjaGFyCSplbmQ7Ci0gICAgIFN0cmluZyBjX3M7Ci0gICAgIHN0YXRpYyBp bnQgY2Y9Y29uZmlnLlZhbHVlKCJjb21wcmVzc2lvbl9sZXZlbCIsMCk7ICAgIAotICAgICBp ZiAoY2YpIHsKLSAgICAgICAvLyBEZWNvbXByZXNzIHN0cmVhbQotICAgICAgIHpfc3RyZWFt IGRfc3RyZWFtOyAvKiBkZWNvbXByZXNzaW9uIHN0cmVhbSAqLwotIAotICAgICAgIGRfc3Ry ZWFtLnphbGxvYyA9IChhbGxvY19mdW5jKTA7Ci0gICAgICAgZF9zdHJlYW0uemZyZWUgPSAo ZnJlZV9mdW5jKTA7Ci0gICAgICAgZF9zdHJlYW0ub3BhcXVlID0gKHZvaWRwZikwOwotIAot ICAgICAgIGRfc3RyZWFtLm5leHRfaW4gID0gKEJ5dGVmKikoY2hhciAqKXN0cmVhbTsKLSAg ICAgICBkX3N0cmVhbS5hdmFpbF9pbiA9IDA7Ci0gCi0gICAgICAgaW50IGVyciA9IGluZmxh dGVJbml0KCZkX3N0cmVhbSk7Ci0gICAgICAgaWYgKGVyciE9Wl9PSykgcmV0dXJuOwotIAot ICAgICAgIGludCBsZW49c3RyZWFtLmxlbmd0aCgpOwotICAgICAgIGRfc3RyZWFtLmF2YWls X2luPWxlbjsKLSAgICAgICB3aGlsZSAoZXJyPT1aX09LICYmIGRfc3RyZWFtLnRvdGFsX2lu PGxlbikgewotICAgICAgICAgZF9zdHJlYW0ubmV4dF9vdXQ9Y19idWZmZXI7Ci0gICAgICAg ICBkX3N0cmVhbS5hdmFpbF9vdXQ9c2l6ZW9mKGNfYnVmZmVyKTsKLSAgICAgICAgIGVycj1p bmZsYXRlKCZkX3N0cmVhbSxaX05PX0ZMVVNIKTsKLSAgICAgICAgIGNfcy5hcHBlbmQoKGNo YXIgKiljX2J1ZmZlcixkX3N0cmVhbS5uZXh0X291dC1jX2J1ZmZlcik7Ci0gICAgICAgICBp ZiAoZXJyID09IFpfU1RSRUFNX0VORCkgYnJlYWs7Ci0gICAgICAgfQotIAotICAgICAgIGVy ciA9IGluZmxhdGVFbmQoJmRfc3RyZWFtKTsKLSAgICAgICBzID0gY19zLmdldCgpOwotICAg ICAgIGVuZCA9IHMgKyBjX3MubGVuZ3RoKCk7Ci0gICAgIH0gZWxzZSB7Ci0gICAgICAgcyA9 IHN0cmVhbS5nZXQoKTsKLSAgICAgICBlbmQgPSBzICsgc3RyZWFtLmxlbmd0aCgpOwotICAg ICB9Ci0gI2Vsc2UKICAgICAgY2hhcgkqcyA9IHN0cmVhbS5nZXQoKTsKICAgICAgY2hhcgkq ZW5kID0gcyArIHN0cmVhbS5sZW5ndGgoKTsKLSAjZW5kaWYKICAgICAgaW50CQlsZW5ndGg7 CiAgICAgIGludAkJY291bnQ7CiAgICAgIGludAkJaTsKLS0tIDM1OSwzNjYgLS0tLQoqKioq KioqKioqKioqKiogdm9pZCBEb2N1bWVudFJlZjo6RGVzZXJpYWxpemUoU3RyaW5nICZzdAoq KiogNDYwLDQ2NSAqKioqCi0tLSA0OTMsNTAxIC0tLS0KICAJICAgIGJyZWFrOwogICAgICAg ICAgY2FzZSBET0NfSEVBRDoKICAgICAgICAgICAgICBnZXRzdHJpbmcoeCwgcywgZG9jSGVh ZCk7CisgI2lmZGVmIEhBVkVfTElCWgorICAgICAgICAgICAgIGRvY0hlYWRTdGF0ZT1kb2NI ZWFkLmxlbmd0aCgpPT0wP0VtcHR5OkNvbXByZXNzZWQ7CisgI2VuZGlmCiAgICAgICAgICAg ICAgYnJlYWs7CiAgCWNhc2UgRE9DX01FVEFEU0M6CiAgCSAgICBnZXRzdHJpbmcoeCwgcywg ZG9jTWV0YURzYyk7Cg== --------------93B73B272DFB9EDE5BAD0F92--

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:18 1999 Return-Path: <hans-peter.nilsson@axis.com> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id LAA19800 for <andrew@contigo.com>; Sat, 30 Jan 1999 11:41:53 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id LAA05661; Sat, 30 Jan 1999 11:40:46 -0800 (PST) From: Hans-Peter Nilsson <hans-peter.nilsson@axis.com> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B3607A.BeroList-2.5.9@sob.htdig.org> Date: Sat, 30 Jan 1999 20:39:57 +0100 Subject: [htdig3-dev] Patch for halving the size of db.words.db (not necessarily for release purposes...)

I intended to sit on this patch until after the release, but I don't see the point in keeping this this to myself until then, at least for purposes of review. And just because I post it here and now does not make it necessary to put it into the release.

Using straightforward approaches it makes db.words.db (slightly less than) half the size (compared to just a previously defining NO_WORD_COUNT). By the way, it probably matters that my example database has less than 64K documents - you may see less savings, down to (wildly guessing) maybe 20% with significantly larger databases and using the default #undef of NO_WORD_COUNT. I did not test this with #undef NO_WORD_COUNT, but will do so ASAP.

There is also a backward-compatibility problem; you cannot drop in a new htsearch with an old db.words.db without doing a htmerge using the new htmerge. On the other hand, other recent changes such as not lowercasing the urls in the database also loses backward compatibility, so that's moot by now. (I'm not saying that change wasn't the Right Thing; I think it was).

If Geoff so pleases, I can commit this. Or he can just sneeze at this clumsy attempt to avert priorities for the release. ;-) Or whatever.

Sat Jan 30 16:40:38 1999 Hans-Peter Nilsson <hp@axis.se>

* htmerge/words.cc (mergeWords): Pack WordRecords in db. * htsearch/parser.cc (perform_push): Unpack WordRecords from db.

* htlib/HtPack.cc: New file. * htlib/HtPack.h: New file. * htlib/Makefile.in (OBJS): Add corresponding *.o files.

Index: htlib/Makefile.in =================================================================== RCS file: /opt/htdig/cvs/htdig3/htlib/Makefile.in,v retrieving revision 1.12 diff -p -c -c -3 -p -b -r1.12 Makefile.in *** htlib/Makefile.in 1999/01/21 13:42:32 1.12 --- htlib/Makefile.in 1999/01/30 19:05:33 *************** OBJS= Configuration.o Connection.o Datab *** 15,21 **** URL.o URLTrans.o cgi.o \ good_strtok.o io.o strcasecmp.o \ strptime.o mytimegm.o HtCodec.o HtWordCodec.o \ ! HtURLCodec.o TARGET= libht.a --- 15,21 ---- URL.o URLTrans.o cgi.o \ good_strtok.o io.o strcasecmp.o \ strptime.o mytimegm.o HtCodec.o HtWordCodec.o \ ! HtURLCodec.o HtPack.o TARGET= libht.a Index: htmerge/words.cc =================================================================== RCS file: /opt/htdig/cvs/htdig3/htmerge/words.cc,v retrieving revision 1.10 diff -p -c -c -3 -p -b -r1.10 words.cc *** htmerge/words.cc 1999/01/25 04:55:54 1.10 --- htmerge/words.cc 1999/01/30 19:06:16 *************** static char RCSid[] = "$Id: words.cc,v 1 *** 43,48 **** --- 43,49 ---- #endif #include "htmerge.h" + #include "HtPack.h" //***************************************************************************** *************** mergeWords(char *wordtmp, char *wordfile *** 63,68 **** --- 64,70 ---- int word_count = 0; WordRecord wr, last_wr; String last_word; + String compressed_data; // // Check for file access errors *************** mergeWords(char *wordtmp, char *wordfile *** 239,251 **** // going to use (shorts and ints) // if (currentWord.length() == 0) { // // First word. Special case. // out = 0; ! out.append((char *) &last_wr, sizeof(last_wr)); currentWord = last_word; } else if (strcmp(last_word, currentWord) == 0) --- 241,261 ---- // going to use (shorts and ints) // + // Or rather, a compressed form thereof. + compressed_data = htPack( + #ifdef NO_WORD_COUNT + "i4" + #else + "i5" + #endif + , (char *) &last_wr); if (currentWord.length() == 0) { // // First word. Special case. // out = 0; ! out.append(compressed_data); currentWord = last_word; } else if (strcmp(last_word, currentWord) == 0) *************** mergeWords(char *wordtmp, char *wordfile *** 253,259 **** // // Add to current record // ! out.append((char *) &last_wr, sizeof(last_wr)); } else { --- 263,269 ---- // // Add to current record // ! out.append(compressed_data); } else { *************** mergeWords(char *wordtmp, char *wordfile *** 265,271 **** currentWord = last_word; out = 0; ! out.append((char *) &last_wr, sizeof(last_wr)); word_count++; if (verbose && word_count == 1) { --- 275,281 ---- currentWord = last_word; out = 0; ! out.append(compressed_data); word_count++; if (verbose && word_count == 1) { *************** mergeWords(char *wordtmp, char *wordfile *** 315,327 **** } putc('\n', wordlist); if (currentWord.length() == 0) { // // First word. Special case. // out = 0; ! out.append((char *) &last_wr, sizeof(last_wr)); currentWord = last_word; } else if (strcmp(last_word, currentWord) == 0) --- 325,344 ---- } putc('\n', wordlist); + compressed_data = htPack( + #ifdef NO_WORD_COUNT + "i4" + #else + "i5" + #endif + , (char *) &last_wr); if (currentWord.length() == 0) { // // First word. Special case. // out = 0; ! out.append(compressed_data); currentWord = last_word; } else if (strcmp(last_word, currentWord) == 0) *************** mergeWords(char *wordtmp, char *wordfile *** 329,335 **** // // Add to current record // ! out.append((char *) &last_wr, sizeof(last_wr)); } else { --- 346,352 ---- // // Add to current record // ! out.append(compressed_data); } else { *************** mergeWords(char *wordtmp, char *wordfile *** 341,347 **** currentWord = last_word; out = 0; ! out.append((char *) &last_wr, sizeof(last_wr)); word_count++; if (verbose && word_count == 1) { --- 358,364 ---- currentWord = last_word; out = 0; ! out.append(compressed_data); word_count++; if (verbose && word_count == 1) { Index: htsearch/parser.cc =================================================================== RCS file: /opt/htdig/cvs/htdig3/htsearch/parser.cc,v retrieving revision 1.6 diff -p -c -c -3 -p -b -r1.6 parser.cc *** htsearch/parser.cc 1998/12/06 18:45:10 1.6 --- htsearch/parser.cc 1999/01/30 19:07:06 *************** static char RCSid[] = "$Id: parser.cc,v *** 32,37 **** --- 32,38 ---- #endif #include "parser.h" + #include "HtPack.h" #define WORD 1000 #define DONE 1001 *************** Parser::perform_push() *** 192,197 **** --- 193,199 ---- { String temp = current->word.get(); String data; + String decompressed; char *p; ResultList *list = new ResultList; WordRecord wr; *************** Parser::perform_push() *** 213,223 **** if (dbf->Get(p, data) == OK) { p = data.get(); ! for (unsigned int i = 0; i < data.length() / sizeof(WordRecord); i++) { ! p = data.get() + i * sizeof(WordRecord); ! memcpy((char *) &wr, p, sizeof(WordRecord)); // // ******* Compute the score for the document // --- 215,230 ---- if (dbf->Get(p, data) == OK) { p = data.get(); ! char *p_end = p + data.length(); ! while (p < p_end) { ! decompressed = htUnpack( ! #ifdef NO_WORD_COUNT ! "i4" ! #else ! "i5" ! #endif ! , p); // // ******* Compute the score for the document

brgds, H-P

-- 
Hans-Peter Nilsson, Axis Communications AB, S - 223 70 LUND, SWEDEN
Hans-Peter.Nilsson@axis.se | Tel +46 462701867,2701800
Fax +46 46136130 | RFC 1855 compliance implemented; report loss of brain.
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
From - Thu Feb  4 22:09:18 1999
Return-Path: <ghutchis@wso.williams.edu>
Received: from sob.htdig.org (htdig.org [209.75.193.22])
	by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id SAA30715
	for <andrew@contigo.com>; Sat, 30 Jan 1999 18:06:46 -0800 (PST)
Received: from sob.htdig.org (localhost [127.0.0.1])
	by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id RAA06436;
	Sat, 30 Jan 1999 17:54:11 -0800 (PST)
From: Geoff Hutchison <ghutchis@wso.williams.edu>
Reply-To: htdig3-dev@htdig.org
Errors-To: htdig3-dev@htdig.org
To: htdig3-dev@htdig.org
Message-ID: <36B3B967.BeroList-2.5.9@sob.htdig.org>
In-Reply-To: <36B3607A.BeroList-2.5.9@sob.htdig.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Sat, 30 Jan 1999 20:51:59 -0400
 necessarily for release purposes...)
Subject: Re: [htdig3-dev] Patch for halving the size of db.words.db (not

>If Geoff so pleases, I can commit this. Or he can just sneeze >at this clumsy attempt to avert priorities for the release. ;-) >Or whatever.

Fortunately I seem to be getting over my cold. ;-) Since I think we've stomped out all or almost all of the bugs, I'm not to thrilled about another major change. I was holding out hope for a release within the next week.

However... I'm always open to opinions on anything. If there's an overwhelming consensus that this should go in, great.

My vote: Geoff -1 (Looks good, but a bit risky so soon before release)

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:18 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id SAA31097 for <andrew@contigo.com>; Sat, 30 Jan 1999 18:22:54 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id SAA06525; Sat, 30 Jan 1999 18:21:30 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B3BE48.BeroList-2.5.9@sob.htdig.org> In-Reply-To: <36B293DC.BeroList-2.5.9@sob.htdig.org> References: <36B20B05.BeroList-2.5.9@sob.htdig.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 30 Jan 1999 21:11:18 -0400 Subject: Re: [htdig3-dev] Moving towards release

>Here is my first attempt...It seems to work with a minor increase in the >document db size.

I don't see any obvious problems. I'm building it right now and I'll test it overnight to see how it goes. If someone could get some gprof results, I'd be interested, but I think this should speed things up a bit.

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:18 1999 Return-Path: <ghutchis@wso.williams.edu> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id KAA03145 for <andrew@contigo.com>; Sun, 31 Jan 1999 10:37:23 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id UAA07065; Sat, 30 Jan 1999 20:21:49 -0800 (PST) From: Geoff Hutchison <ghutchis@wso.williams.edu> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B3DBA7.BeroList-2.5.9@sob.htdig.org> In-Reply-To: <36B1D9E8.BeroList-2.5.9@sob.htdig.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 30 Jan 1999 22:34:37 -0400 Subject: Re: [htdig3-dev] htsearch as a variable?

>is it reasonable to add a config variable to >describe the name of the "htsearch" program

OK, I took a look at the requirements for putting this into the Makefiles and configure scripts. I just added it, but I haven't had a chance to try it out.

Basically, the configure script now supports:

configure --prorgram-prefix=PREFIX configure --program-suffix=SUFFIX configure --program-transform=REGEX

Cool, huh?

-Geoff

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message. From - Thu Feb 4 22:09:18 1999 Return-Path: <jjah@cloud.ccsf.cc.ca.us> Received: from sob.htdig.org (htdig.org [209.75.193.22]) by rodan.contigo.com (8.9.2/8.8.8/Debian/GNU) with ESMTP id SAA31813 for <andrew@contigo.com>; Sat, 30 Jan 1999 18:48:20 -0800 (PST) Received: from sob.htdig.org (localhost [127.0.0.1]) by sob.htdig.org (8.9.2/8.9.1/Debian/GNU) with SMTP id SAA06696; Sat, 30 Jan 1999 18:45:54 -0800 (PST) From: "Joe R. Jah" <jjah@cloud.ccsf.cc.ca.us> Reply-To: htdig3-dev@htdig.org Errors-To: htdig3-dev@htdig.org To: htdig3-dev@htdig.org Message-ID: <36B3C401.BeroList-2.5.9@sob.htdig.org> Date: Sat, 30 Jan 1999 18:44:56 -0800 (PST) In-Reply-To: <36B3B967.BeroList-2.5.9@sob.htdig.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: Re: [htdig3-dev] Patch for halving the size of db.words.db (not

On Sat, 30 Jan 1999, Geoff Hutchison wrote:

> Date: Sat, 30 Jan 1999 20:51:59 -0400 necessarily for release purposes...) > From: Geoff Hutchison <ghutchis@wso.williams.edu> > Reply-To: htdig3-dev@htdig.org > To: htdig3-dev@htdig.org > Subject: Re: [htdig3-dev] Patch for halving the size of db.words.db (not > > However... I'm always open to opinions on anything. If there's an > overwhelming consensus that this should go in, great. > > My vote: Geoff -1 (Looks good, but a bit risky so soon before release)

Joe +1

_/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jjah@cloud.ccsf.cc.ca.us

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Feb 04 1999 - 22:13:09 PST