[htdig] honking big patch file collection for 3.1.2


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Fri, 6 Aug 1999 17:23:00 -0500 (CDT)


Hi, folks. Over the past week, I've put together a big collection of
patch files for htdig-3.1.2, to fix many of the bugs that have been
reported over the past three and a half months, since the last release.

Some of these were contributed by others. Many were backported from the
3.2 development code, and several were put together by me in the past
week. Next week, I'll make sure that any of these that haven't made it
into 3.2 yet will. In the meantime, I'd appreciate any feedback from
all of you as to whether these patches really do fix the problems they
claim to, or if they introduce other problems. Each patch is preceeded
by a brief description, so you can pick them out and apply them one by
one if you want, but I had no problem applying the whole collection at
once with "patch -p1" on my Red Hat Linux box.

Here's a summary of the changes:
    - PR#339 fixed - URL encodes all non-ASCII characters in URIs
    - PR#560 fixed - prevent inappropriate suffix stripping in endings fuzzy
    - PR#542 fixed - URL passed to external parser now quoted
    - PR#541 fixed - ANCHOR variable now set properly
    - PR#535 & PR#557 fixed - HTTP header parsing now more robust
    - username/password now blotted out from command arguments
    - adds support for <embed>, <object> and <link> tags
    - PR#554 fixed - locale now affects default date format in htsearch
    - fixes the bug in the handling of modification_time_is_now
    - PR#578 fixed - multiple directives in <meta> robots tag now work
    - now gives an error message for unknown hosts
    - empty or null strings won't cause htfuzzy to core dump
    - PDF parser now clears title string properly when done with it
    - PR#543 & PR#585 fixed - names like left_index.html no longer stripped
    - fixes server_alias entries so port defaults to 80 if omitted
    - decodes SGML entities inside tag attributes
    - PR#566 fixed - urls like 'http:/dir/file.ext' resolved properly
    - $(VAR) at end of template string now being expanded properly
    - PR#595 fixed - corrected address for FSF
    - maximum word length now a config attribute, not compile-time option
    - PR#81 & PR#472 fixed - htdig -vvv shouldn't crash in strftime()
    - PR#348 fixed - missing or invalid port number will get set correctly
    - PR#493 fixed - valid URL with ".." within a file name not rejected
    - PR#572 fixed - htsearch won't crash if CONTENT_LENGTH not set
    - PR#545 fixed - configure tests for presence of alloca.h for regex.c
    - documentation updates, including PR#558 & PR#626.

-------- 8< -------- snip -------- 8< --------
This patch should fix PR#545, to test for presence of alloca.h

--- htdig-3.1.2.bak/configure.in Wed Apr 21 21:47:53 1999
+++ htdig-3.1.2/configure.in Wed Aug 4 16:17:57 1999
@@ -13,7 +13,7 @@
 #
 # You should have received a copy of the GNU General Public License
 # along with this program; if not, write to the Free Software
-# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
 #
 
 AC_INIT(htcommon/DocumentDB.cc)
@@ -79,7 +79,7 @@
 
 dnl More header checks--here use C++
 AC_LANG_CPLUSPLUS
-AC_CHECK_HEADERS(fcntl.h limits.h malloc.h sys/file.h sys/ioctl.h sys/time.h unistd.h getopt.h strings.h zlib.h)
+AC_CHECK_HEADERS(fcntl.h limits.h malloc.h sys/file.h sys/ioctl.h sys/time.h unistd.h getopt.h strings.h zlib.h alloca.h)
 AC_CHECK_HEADER(fstream.h,nofstream=0,nofstream=1)
 if test "x$nofstream" = "x1" ; then
 AC_MSG_ERROR([To compile ht://Dig, you will need a C++ library. Try installing libstdc++.])
--- htdig-3.1.2.bak/configure Wed Apr 21 21:47:53 1999
+++ htdig-3.1.2/configure Wed Aug 4 16:17:57 1999
@@ -2010,7 +2010,7 @@
 CXXCPP="$ac_cv_prog_CXXCPP"
 echo "$ac_t""$CXXCPP" 1>&6
 
-for ac_hdr in fcntl.h limits.h malloc.h sys/file.h sys/ioctl.h sys/time.h unistd.h getopt.h strings.h zlib.h
+for ac_hdr in fcntl.h limits.h malloc.h sys/file.h sys/ioctl.h sys/time.h unistd.h getopt.h strings.h zlib.h alloca.h
 do
 ac_safe=`echo "$ac_hdr" | sed 'y%./+-%__p_%'`
 echo $ac_n "checking for $ac_hdr""... $ac_c" 1>&6
--- htdig-3.1.2.bak/include/htconfig.h.in Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/include/htconfig.h.in Wed Aug 4 16:30:10 1999
@@ -55,6 +55,9 @@
 
 /* Define if you have the <zlib.h> header file. */
 #undef HAVE_ZLIB_H
+
+/* Define if you have the <alloca.h> header file. */
+#undef HAVE_ALLOCA_H
 
 /* Define if you have the <sys/file.h> header file. */
 #undef HAVE_SYS_FILE_H
--- htdig-3.1.2.bak/htlib/regex.c Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htlib/regex.c Wed Aug 4 16:20:48 1999
@@ -27,6 +27,7 @@
 #undef _GNU_SOURCE
 #define _GNU_SOURCE
 
+#include <htconfig.h>
 #ifdef HAVE_CONFIG_H
 # include <config.h>
 #endif

This adds descriptions for attributes that were missing, adds a few
clarifications, and corrects a few defaults and typos. Covers PR#558,
PR#626, and then some.

--- htdig-3.1.2.bak/htdoc/attrs.html Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdoc/attrs.html Fri Aug 6 14:00:28 1999
@@ -413,6 +413,57 @@
         <hr>
         <dl>
           <dt>
+ <strong><a name="bin_dir">bin_dir</a></strong>
+ </dt>
+ <dd>
+ <dl>
+ <dt>
+ <em>type:</em>
+ </dt>
+ <dd>
+ string
+ </dd>
+ <dt>
+ <em>used by:</em>
+ </dt>
+ <dd>
+ <a href="htdig.html">htdig</a>,
+ <a href="htnotify.html">htnotify</a>,
+ <a href="htfuzzy.html">htfuzzy</a>,
+ <a href="htmerge.html">htmerge</a> and
+ <a href="htsearch.html" target="_top">htsearch</a>
+ </dd>
+ <dt>
+ <em>default:</em>
+ </dt>
+ <dd>
+ BIN_DIR
+ </dd>
+ <dt>
+ <em>description:</em>
+ </dt>
+ <dd>
+ This is the directory in which the executables
+ related to ht://Dig are installed. It is never used
+ directly by any of the programs, but other attributes
+ can be defined in terms of this one.
+ <p>
+ The default value of this attribute is determined at
+ compile time.
+ </p>
+ </dd>
+ <dt>
+ <em>example:</em>
+ </dt>
+ <dd>
+ bin_dir: /usr/local/bin
+ </dd>
+ </dl>
+ </dd>
+ </dl>
+ <hr>
+ <dl>
+ <dt>
                 <strong><a name="case_sensitive">case_sensitive</a></strong>
           </dt>
           <dd>
@@ -595,7 +646,8 @@
                   <dd>
                         If specified and the <a
                         href="http://www.cdrom.com/pub/infozip/zlib/">zlib</a>
- compression library was available when compiledi controls
+ compression library was available when compiled,
+ this attribute controls
                         the amount of compression used in the <a
                         href="#doc_db">doc_db</a> file. Defaults to zero to
                         provide backward compatility with old databases.
@@ -612,6 +664,58 @@
         <hr>
         <dl>
           <dt>
+ <strong><a name="config_dir">config_dir</a></strong>
+ </dt>
+ <dd>
+ <dl>
+ <dt>
+ <em>type:</em>
+ </dt>
+ <dd>
+ string
+ </dd>
+ <dt>
+ <em>used by:</em>
+ </dt>
+ <dd>
+ <a href="htdig.html">htdig</a>,
+ <a href="htnotify.html">htnotify</a>,
+ <a href="htfuzzy.html">htfuzzy</a>,
+ <a href="htmerge.html">htmerge</a> and
+ <a href="htsearch.html" target="_top">htsearch</a>
+ </dd>
+ <dt>
+ <em>default:</em>
+ </dt>
+ <dd>
+ CONFIG_DIR
+ </dd>
+ <dt>
+ <em>description:</em>
+ </dt>
+ <dd>
+ This is the directory which contains all configuration
+ files related to ht://Dig. It is never used
+ directly by any of the programs, but other attributes
+ or the <a href="#include">include</a> directive
+ can be defined in terms of this one.
+ <p>
+ The default value of this attribute is determined at
+ compile time.
+ </p>
+ </dd>
+ <dt>
+ <em>example:</em>
+ </dt>
+ <dd>
+ config_dir: /var/htdig/conf
+ </dd>
+ </dl>
+ </dd>
+ </dl>
+ <hr>
+ <dl>
+ <dt>
                 <strong><a name="create_image_list">
                 create_image_list</a></strong>
           </dt>
@@ -1459,7 +1563,7 @@
                         <em>default:</em>
                   </dt>
                   <dd>
- cgi-bin .cgi
+ /cgi-bin/ .cgi
                   </dd>
                   <dt>
                         <em>description:</em>
@@ -2136,6 +2240,103 @@
         <hr>
         <dl>
           <dt>
+ <strong><a name="image_url_prefix">image_url_prefix</a></strong>
+ </dt>
+ <dd>
+ <dl>
+ <dt>
+ <em>type:</em>
+ </dt>
+ <dd>
+ string
+ </dd>
+ <dt>
+ <em>used by:</em>
+ </dt>
+ <dd>
+ <a href="htsearch.html" target="_top">htsearch</a>
+ </dd>
+ <dt>
+ <em>default:</em>
+ </dt>
+ <dd>
+ IMAGE_URL_PREFIX
+ </dd>
+ <dt>
+ <em>description:</em>
+ </dt>
+ <dd>
+ This specifies the directory portion of the URL used
+ to display star images. This attribute isn't directly
+ used by htsearch, but is used in the default URL for
+ the <a href="#star_image">star_image</a> and
+ <a href="#star_blank">star_blank</a> attributes, and
+ other attributes may be defined in terms of this one.
+ <p>
+ The default value of this attribute is determined at
+ compile time.
+ </p>
+ </dd>
+ <dt>
+ <em>example:</em>
+ </dt>
+ <dd>
+ image_url_prefix: /images/htdig
+ </dd>
+ </dl>
+ </dd>
+ </dl>
+ <hr>
+ <dl>
+ <dt>
+ <strong><a name="include">include</a></strong>
+ </dt>
+ <dd>
+ <dl>
+ <dt>
+ <em>type:</em>
+ </dt>
+ <dd>
+ string
+ </dd>
+ <dt>
+ <em>used by:</em>
+ </dt>
+ <dd>
+ <a href="htdig.html">htdig</a>,
+ <a href="htnotify.html">htnotify</a>,
+ <a href="htfuzzy.html">htfuzzy</a>,
+ <a href="htmerge.html">htmerge</a> and
+ <a href="htsearch.html" target="_top">htsearch</a>
+ </dd>
+ <dt>
+ <em>description:</em>
+ </dt>
+ <dd>
+ This is not quite a configuration attribute, but
+ rather a directive. It can be used within one
+ configuration file to include the definitions of
+ another file. The last definition of an attribute
+ is the one that applies, so after including a file,
+ any of its definitions can be overridden with
+ subsequent definitions. This can be useful when
+ setting up many configurations that are mostly the
+ same, so all the common attributes can be maintained
+ in a single configuration file. The include directives
+ can be nested, but watch out for nesting loops.
+ </dd>
+ <dt>
+ <em>example:</em>
+ </dt>
+ <dd>
+ include: ${config_dir}/htdig.conf
+ </dd>
+ </dl>
+ </dd>
+ </dl>
+ <hr>
+ <dl>
+ <dt>
                 <strong><a name="iso_8601">iso_8601</a></strong>
           </dt>
           <dd>
@@ -4045,6 +4246,11 @@
                       that is part of the <a
                       href="http://www.foolabs.com/xpdf/">xpdf</a>
                       0.80 package have been tested as pdf_parsers.
+ <p>
+ The default value of this attribute is determined at
+ compile time, to include the path to the acroread
+ executable.
+ </p>
                   </dd>
                   <dt>
                         <em>example:</em>
@@ -4521,6 +4727,10 @@
                         if no matches were found. In this case the
                         <a href="#nothing_found_file">nothing_found_file</a>
                         attribute is used instead.
+ Also, this file will not be output if it is
+ overridden by defining the
+ <a href="#search_results_wrapper">search_results_wrapper</a>
+ attribute.
                   </dd>
                   <dt>
                         <em>example:</em>
@@ -4633,6 +4843,10 @@
                         if no matches were found. In this case the
                         <a href="#nothing_found_file">nothing_found_file</a>
                         attribute is used instead.
+ Also, this file will not be output if it is
+ overridden by defining the
+ <a href="#search_results_wrapper">search_results_wrapper</a>
+ attribute.
                   </dd>
                   <dt>
                         <em>example:</em>
@@ -6256,7 +6470,7 @@
                         <em>default:</em>
                   </dt>
                   <dd>
- .-_/!#$%^&amp;*'
+ .-_/!#$%^&amp;'
                   </dd>
                   <dt>
                         <em>description:</em>
@@ -6285,6 +6499,50 @@
         <hr>
         <dl>
           <dt>
+ <strong><a name="version">version</a></strong>
+ </dt>
+ <dd>
+ <dl>
+ <dt>
+ <em>type:</em>
+ </dt>
+ <dd>
+ string
+ </dd>
+ <dt>
+ <em>used by:</em>
+ </dt>
+ <dd>
+ <a href="htsearch.html" target="_top">htsearch</a>
+ </dd>
+ <dt>
+ <em>default:</em>
+ </dt>
+ <dd>
+ VERSION
+ </dd>
+ <dt>
+ <em>description:</em>
+ </dt>
+ <dd>
+ This specifies the value of the VERSION
+ variable which can be used in search templates.
+ The default value of this attribute is determined
+ at compile time, and will not normally be set
+ in configuration files.
+ </dd>
+ <dt>
+ <em>example:</em>
+ </dt>
+ <dd>
+ version: 3.1.2PL1
+ </dd>
+ </dl>
+ </dd>
+ </dl>
+ <hr>
+ <dl>
+ <dt>
                 <strong><a name="word_db">word_db</a></strong>
           </dt>
           <dd>
@@ -6385,7 +6643,7 @@
           <a href="author.html">Andrew Scherpbier &lt;andrew@contigo.com&gt;</a>
         </address>
 <!-- hhmts start -->
-Last modified: Sun Feb 14 21:51:44 EST 1999
+Last modified: Fri Aug 6 15:00:15 EDT 1999
 <!-- hhmts end -->
   </body>
 </html>
--- htdig-3.1.2.bak/htdoc/cf_byname.html Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdoc/cf_byname.html Fri Aug 6 14:16:41 1999
@@ -24,12 +24,14 @@
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#bad_extensions">bad_extensions</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#bad_querystr">bad_querystr</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#bad_word_list">bad_word_list</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#bin_dir">bin_dir</a><br>
         </font> <br>
         <b>C</b> <font face="helvetica,arial" size="2"><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#case_sensitive">case_sensitive</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#common_dir">common_dir</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#common_url_parts">common_url_parts</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#compression_level">compression_level</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#config_dir">config_dir</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#create_image_list">create_image_list</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#create_url_list">create_url_list</a><br>
         </font> <br>
@@ -68,6 +70,8 @@
         </font> <br>
         <b>I</b> <font face="helvetica,arial" size="2"><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#image_list">image_list</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#image_url_prefix">image_url_prefix</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#include">include</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#iso_8601">iso_8601</a><br>
         </font> <br>
         <b>K</b> <font face="helvetica,arial" size="2"><br>
@@ -170,6 +174,7 @@
         </font> <br>
         <b>V</b> <font face="helvetica,arial" size="2"><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#valid_punctuation">valid_punctuation</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#version">version</a><br>
         </font> <br>
         <b>W</b> <font face="helvetica,arial" size="2"><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#word_db">word_db</a><br>
--- htdig-3.1.2.bak/htdoc/cf_byprog.html Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdoc/cf_byprog.html Fri Aug 6 14:19:45 1999
@@ -168,6 +168,7 @@
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#use_meta_description">use_meta_description</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#use_star_image">use_star_image</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#valid_punctuation">valid_punctuation</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#version">version</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#word_db">word_db</a><br>
         </font>
           <form action="http://www.htdig.org/cgi-bin/htsearch" target=body>

We uncovered a bug back on May 20, in the encodeURL() function. This
function should encode all non-ascii characters, but right now it doesn't.
I think this is what PR#339 was all about. Here's the fix:

--- htdig-3.1.2/htlib/URLTrans.cc.orig Tue Feb 16 23:03:56 1999
+++ htdig-3.1.2/htlib/URLTrans.cc Wed Jun 2 08:29:05 1999
@@ -75,7 +75,7 @@ void encodeURL(String &str, char *valid)
 
     for (p = str; p && *p; p++)
     {
- if (isdigit(*p) || isalpha(*p) || strchr(valid, *p))
+ if (isascii(*p) && (isdigit(*p) || isalpha(*p) || strchr(valid, *p)))
             temp << *p;
         else
         {

Suffix-handling improvement (PR#560), to prevent inappropriate suffix
stripping in endings fuzzy matches.

> From: Steve Arlow <yorick@ClarkHill.com>
> Subject: Suffix-handling improvement
> To: htdig3-bugs@htdig.org
> Date: Tue, 8 Jun 1999 19:57:54 -0400 (EDT)
> Cc: yorick@yorick.com
>
> Hello,
>
> I do consulting for a number of law firms, and quickly discovered a
> problem with htfuzzy matching on the word "witness". (There are
> three root words in the distribution dictionary that end in "-ness"
> and also certainly exhibit this problem; the other two are
> "highness" and "likeness". Other words can also be argued about.)
>
> The fix (which does not appear to break anything else AFAICT, but
> may have a small effect on performance) is to add a preliminary check
> on root2word before trying word2root. The code is below (from the
> file htdig-3.1.2/htfuzzy/Endings.cc), optimize it to your taste.

Follow-up example:
> Words of the form XXXness which are not a form of the word XXX. If I
> enter "witness" into htdig with matching for alternate endings enabled,
> it will look for "wit", "wits", or "witness". What it should really be
> looking for is "witness", "witnessed", "witnessing", or "witnesses".
>
> A similar problem might occur with other suffixes, but I can't think of
> an example off the top of my head.
>
> The fix is to try to interpret each term as a root word before trying
> to interpret it as an alternate form.

--- htdig-3.1.2/htfuzzy/Endings.cc.endingsbug Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htfuzzy/Endings.cc Fri Jul 30 14:43:57 1999
@@ -68,22 +68,6 @@ Endings::getWords(char *w, List &words)
     String word = w;
     word.lowercase();
         
- if (word2root->Get(word, data) == OK)
- {
- //
- // Found the root of the word. We'll add it to the list already
- //
- word = data;
- words.Add(new String(word));
- }
- else
- {
- //
- // The root wasn't found. This could mean that the word
- // is already the root.
- //
- }
-
     if (root2word->Get(word, data) == OK)
     {
         //
@@ -97,6 +81,40 @@ Endings::getWords(char *w, List &words)
                 words.Add(new String(token));
             }
             token = strtok(0, " ");
+ }
+ }
+ else
+ {
+ if (word2root->Get(word, data) == OK)
+ {
+ //
+ // Found the root of the word. We'll add it to the list already
+ //
+ word = data;
+ words.Add(new String(word));
+ }
+ else
+ {
+ //
+ // The root wasn't found. This could mean that the word
+ // is already the root.
+ //
+ }
+
+ if (root2word->Get(word, data) == OK)
+ {
+ //
+ // Found the root's permutations
+ //
+ char *token = strtok(data.get(), " ");
+ while (token)
+ {
+ if (mystrcasecmp(token, w) != 0)
+ {
+ words.Add(new String(token));
+ }
+ token = strtok(0, " ");
+ }
         }
     }
 }

Quote the filename before passing it to the command-line to prevent
shell escapes. Fixes PR#542. Also make error messages more useful.

--- htdig-3.1.2/htdig/ExternalParser.cc.old Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdig/ExternalParser.cc Fri Jul 30 15:08:57 1999
@@ -133,8 +133,8 @@ ExternalParser::parse(Retriever &retriev
     // Now start the external parser.
     //
     String command = currentParser;
- command << ' ' << path << ' ' << contentType << ' ' << base.get() <<
- ' ' << configFile;
+ command << ' ' << path << ' ' << contentType << " \"" << base.get() <<
+ "\" " << configFile;
 
     FILE *input = popen(command, "r");
     if (!input)
@@ -170,7 +170,7 @@ ExternalParser::parse(Retriever &retriev
                         (hd = atoi(token3)) >= 0 && hd < 12)
                   retriever.got_word(token1, loc, hd);
                 else
- cerr<< "External parser error in line:"<<line<<"\n";
+ cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
                 break;
                 
             case 'u': // href
@@ -183,7 +183,7 @@ ExternalParser::parse(Retriever &retriev
                   retriever.got_href(url, token2);
                 }
                 else
- cerr<< "External parser error in line:"<<line<<"\n";
+ cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
                 break;
                 
             case 't': // title
@@ -191,7 +191,7 @@ ExternalParser::parse(Retriever &retriev
                 if (token1 != NULL)
                   retriever.got_title(token1);
                 else
- cerr<< "External parser error in line:"<<line<<"\n";
+ cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
                 break;
                 
             case 'h': // head
@@ -199,7 +199,7 @@ ExternalParser::parse(Retriever &retriev
                 if (token1 != NULL)
                   retriever.got_head(token1);
                 else
- cerr<< "External parser error in line:"<<line<<"\n";
+ cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
                 break;
                 
             case 'a': // anchor
@@ -207,7 +207,7 @@ ExternalParser::parse(Retriever &retriev
                 if (token1 != NULL)
                   retriever.got_anchor(token1);
                 else
- cerr<< "External parser error in line:"<<line<<"\n";
+ cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
                 break;
                 
             case 'i': // image url
@@ -215,7 +215,7 @@ ExternalParser::parse(Retriever &retriev
                 if (token1 != NULL)
                   retriever.got_image(token1);
                 else
- cerr<< "External parser error in line:"<<line<<"\n";
+ cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
                 break;
 
             case 'm': // meta
@@ -329,12 +329,12 @@ ExternalParser::parse(Retriever &retriev
                   }
                 }
                 else
- cerr<< "External parser error in line:"<<line<<"\n";
+ cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
                 break;
               }
 
             default:
- cerr<< "External parser error in line:"<<line<<"\n";
+ cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
                 break;
         }
     }

Fix declaration to refer to first as reference--ensures ANCHOR is properly
set. Fixes PR#541 as suggested by <pmb1@york.ac.uk>.

--- htdig-3.1.2.bak/htsearch/Display.h Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htsearch/Display.h Fri Jul 30 14:23:56 1999
@@ -151,7 +151,7 @@ protected:
     String *readFile(char *);
     void expandVariables(char *);
     void outputVariable(char *);
- String *excerpt(DocumentRef *ref, String urlanchor, int fanchor, int first);
+ String *excerpt(DocumentRef *ref, String urlanchor, int fanchor, int &first);
     char *hilight(char *str, String urlanchor, int fanchor);
     void setupImages();
     String *generateStars(DocumentRef *, int);
--- htdig-3.1.2.bak/htsearch/Display.cc Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htsearch/Display.cc Fri Jul 30 14:24:05 1999
@@ -959,7 +959,7 @@ Display::buildMatchList()
 
 //*****************************************************************************
 String *
-Display::excerpt(DocumentRef *ref, String urlanchor, int fanchor, int first)
+Display::excerpt(DocumentRef *ref, String urlanchor, int fanchor, int &first)
 {
     char *head;
     int use_meta_description = 0;

This patch fixes PR#348, to make sure a missing or invalid port number will
get set correctly.

--- htdig-3.1.2.bak/htlib/URL.cc Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htlib/URL.cc Wed Aug 4 13:09:01 1999
@@ -282,6 +282,8 @@ void URL::parse(char *u)
         p = strtok(0, "/");
         if (p)
             _port = atoi(p);
+ if (!p || _port <= 0)
+ _port = 80;
     }
     else
     {

This should fix PR#493, to avoid rejecting a valid URL with ".." in it.

--- htdig-3.1.2.bak/htdig/Retriever.cc Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdig/Retriever.cc Wed Aug 4 15:51:44 1999
@@ -625,7 +625,7 @@ Retriever::IsValidURL(char *u)
     // Currently, we only deal with HTTP URLs. Gopher and ftp will
     // come later... ***FIX***
     //
- if (strstr(u, "..") || strncmp(u, "http://", 7) != 0)
+ if (strstr(u, "/../") || strncmp(u, "http://", 7) != 0)
       {
         if (debug > 2)
           cout << endl <<" Rejected: Not an http or relative link!";

This updates the FSF address in COPYING & Makefile.in. PR#595.
The address is still old in configure.in, but we won't touch it
here so that we don't need to run autoconf.

--- htdig3.1.2.bak/COPYING Tue Feb 16 23:03:53 1999
+++ htdig3.1.2/COPYING Wed Aug 4 07:40:22 1999
@@ -2,7 +2,7 @@
                        Version 2, June 1991
 
  Copyright (C) 1989, 1991 Free Software Foundation, Inc.
- 675 Mass Ave, Cambridge, MA 02139, USA
+ 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
  Everyone is permitted to copy and distribute verbatim copies
  of this license document, but changing it is not allowed.
 
@@ -305,7 +305,8 @@
 
     You should have received a copy of the GNU General Public License
     along with this program; if not, write to the Free Software
- Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+
 
 Also add information on how to contact you by electronic and paper mail.
 
--- htdig3.1.2.bak/htdoc/COPYING Tue Feb 16 23:03:53 1999
+++ htdig3.1.2/htdoc/COPYING Wed Aug 4 07:40:22 1999
@@ -2,7 +2,7 @@
                        Version 2, June 1991
 
  Copyright (C) 1989, 1991 Free Software Foundation, Inc.
- 675 Mass Ave, Cambridge, MA 02139, USA
+ 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
  Everyone is permitted to copy and distribute verbatim copies
  of this license document, but changing it is not allowed.
 
@@ -305,7 +305,8 @@
 
     You should have received a copy of the GNU General Public License
     along with this program; if not, write to the Free Software
- Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+
 
 Also add information on how to contact you by electronic and paper mail.
 
--- htdig-3.1.2.bak/Makefile.in Wed Apr 21 21:47:53 1999
+++ htdig-3.1.2/Makefile.in Wed Aug 4 10:10:54 1999
@@ -13,7 +13,7 @@
    
 # You should have received a copy of the GNU General Public License
 # along with this program; if not, write to the Free Software
-# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
 
 top_srcdir= @top_srcdir@
 srcdir= @srcdir@

This should help with PR#81 & PR#472, where strftime() would crash on
some systems. Idea submitted by benoit.sibaud@cnet.francetelecom.fr.

--- htdig-3.1.2.bak/htdig/Document.cc Wed Aug 4 12:43:27 1999
+++ htdig-3.1.2/htdig/Document.cc Wed Aug 4 13:37:43 1999
@@ -215,6 +215,8 @@ Document::getdate(char *datestring)
         // correct for mystrptime, if %Y format saw only a 2 digit year
         if (tm.tm_year < 0)
           tm.tm_year += 1900;
+ tm.tm_yday = 0; // clear these to prevent problems in strftime()
+ tm.tm_wday = 0;
         
         if (debug > 2)
           {

This patch fixes a few problems with header parsing, including PR#535 & PR#557.

--- htdig-3.1.2/htdig/Document.cc.hdrparsebug Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdig/Document.cc Fri Jul 30 14:15:10 1999
@@ -478,14 +478,18 @@ Document::readHeader(Connection &c)
             inHeader = 0;
         else
         {
+ char *token = line.get();
+ while (*token && !isspace(*token))
+ token++;
+ while (*token && isspace(*token))
+ token++;
             if (strncmp(line, "HTTP/", 5) == 0)
             {
                 //
                 // Found the status line. This will determine if we
                 // continue or not
                 //
- strtok(line, " ");
- char *status = strtok(0, " ");
+ char *status = strtok(token, " ");
                 if (status && strcmp(status, "200") == 0)
                 {
                     returnStatus = Header_ok;
@@ -508,22 +512,19 @@ Document::readHeader(Connection &c)
                     returnStatus = Header_not_authorized;
                 }
             }
- else if (modtime == 0
+ else if (modtime == 0 && *token
                      && mystrncasecmp(line, "last-modified:", 14) == 0)
             {
- strtok(line, " \t");
- modtime = getdate(strtok(0, "\n\t"));
+ modtime = getdate(strtok(token, "\n\t"));
             }
- else if (contentLength == -1
+ else if (contentLength == -1 && *token
                      && mystrncasecmp(line, "content-length:", 15) == 0)
             {
- strtok(line, " \t");
- contentLength = atoi(strtok(0, "\n\t"));
+ contentLength = atoi(strtok(token, "\n\t"));
             }
- else if (mystrncasecmp(line, "content-type:", 13) == 0)
+ else if (*token && mystrncasecmp(line, "content-type:", 13) == 0)
             {
- strtok(line, " \t");
- char *token = strtok(0, "\n\t");
+ token = strtok(token, "\n\t");
                                 
                 if ((returnStatus == Header_not_found ||
                         returnStatus == Header_ok) &&
@@ -537,8 +538,7 @@ Document::readHeader(Connection &c)
             }
             else if (mystrncasecmp(line, "location:", 9) == 0)
             {
- strtok(line, " \t");
- redirected_to = strtok(0, "\r\n \t");
+ redirected_to = strtok(token, "\r\n \t");
             }
         }
     }

This is Geoff's patch to hide the username/password in the command line
arguments.

--- htdig-3.1.2/htdig/htdig.cc.orig Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdig/htdig.cc Fri Jul 30 17:24:32 1999
@@ -79,6 +79,8 @@ main(int ac, char **av)
                 break;
             case 'u':
                 credentials = optarg;
+ for (int pos = 0; pos < strlen(optarg); pos++)
+ optarg[pos] = '*';
                 break;
             case 'a':
                 alt_work_area++;

This patch adds support for <embed>, <object> and <link> tags.
(Don't you wish all additions could be this easy?)

--- htdig-3.1.2/htdig/HTML.cc.old Fri Jul 30 12:24:14 1999
+++ htdig-3.1.2/htdig/HTML.cc Fri Jul 30 13:16:55 1999
@@ -63,7 +63,7 @@ HTML::HTML()
     // the attrs Match object is used to match names of tag parameters.
     //
     tags.IgnoreCase();
- tags.Pattern("title|/title|a|/a|h1|h2|h3|h4|h5|h6|/h1|/h2|/h3|/h4|/h5|/h6|noindex|/noindex|img|li|meta|frame|area|base");
+ tags.Pattern("title|/title|a|/a|h1|h2|h3|h4|h5|h6|/h1|/h2|/h3|/h4|/h5|/h6|noindex|/noindex|img|li|meta|frame|area|base|embed|object|link");
 
     attrs.IgnoreCase();
     attrs.Pattern("src|href|name");
@@ -894,6 +894,8 @@ HTML::do_tag(Retriever &retriever, Strin
         }
 
         case 21: // frame
+ case 24: // embed
+ case 25: // object
         {
             which = -1;
             int pos = srcMatch.FindFirstWord(position, which, length);
@@ -963,6 +965,7 @@ HTML::do_tag(Retriever &retriever, Strin
         }
         
         case 22: // area
+ case 26: // link
         {
             which = -1;
             int pos = hrefMatch.FindFirstWord(position, which, length);
@@ -972,7 +975,7 @@ HTML::do_tag(Retriever &retriever, Strin
                 case 0: // "href"
                 {
                     //
- // src seen
+ // href seen
                     //
                     while (*position && *position != '=')
                         position++;

Torsten Neuer's <tneuer@inwise.de> fix for PR# 554.

--- htdig-3.1.2.bak/htsearch/Display.cc Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htsearch/Display.cc Tue Aug 3 14:46:30 1999
@@ -20,6 +20,7 @@ static char RCSid[] = "$Id: Display.cc,v
 #include <stdio.h>
 #include <ctype.h>
 #include <syslog.h>
+#include <locale.h>
 #include "HtURLCodec.h"
 #include "HtWordType.h"
 
@@ -318,6 +319,7 @@ Display::displayMatch(ResultMatch *match
         {
             struct tm *tm = localtime(&t);
             char *datefmt = config["date_format"];
+ char *locale = config["locale"];
             if (!datefmt || !*datefmt)
               {
                 if (config.Boolean("iso_8601"))
@@ -325,6 +327,10 @@ Display::displayMatch(ResultMatch *match
                 else
                     datefmt = "%x";
               }
+ if ( locale && *locale )
+ {
+ setlocale(LC_TIME,locale);
+ }
             strftime(buffer, sizeof(buffer), datefmt, tm);
             *str << buffer;
         }

This patch turns the maximum word length into a run-time option, rather
than compile-time.

--- htdig-3.1.2.bak/include/htconfig.h.in Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/include/htconfig.h.in Wed Aug 4 10:43:33 1999
@@ -5,7 +5,6 @@
 #define _config_h_
 
 #define VERSION 1
-#define MAX_WORD_LENGTH 12
 
 /* Define if on AIX 3.
    System headers sometimes define this.
--- htdig-3.1.2.bak/htcommon/WordReference.h Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htcommon/WordReference.h Wed Aug 4 10:44:12 1999
@@ -25,7 +25,7 @@ public:
                                         WordReference() {}
                                         ~WordReference() {}
 
- char Word[MAX_WORD_LENGTH + 1];
+ String Word;
         int WordCount;
         int Weight;
         int Location;
--- htdig-3.1.2.bak/htcommon/WordList.cc Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htcommon/WordList.cc Wed Aug 4 12:22:31 1999
@@ -46,11 +46,12 @@ void WordList::Word(char *word, int loca
   if (weight_factor == 0.0) // Why should we add words with no weight?
       return;
     String shortword = word;
+ static int maximum_word_length = config.Value("maximum_word_length", 12);
 
     shortword.lowercase();
     word = shortword.get();
- if (shortword.length() > MAX_WORD_LENGTH)
- word[MAX_WORD_LENGTH] = '\0';
+ if (shortword.length() > maximum_word_length)
+ word[maximum_word_length] = '\0';
 
     if (!valid_word(word))
         return;
@@ -80,7 +81,7 @@ void WordList::Word(char *word, int loca
         wordRef->DocumentID = docID;
         wordRef->Weight = int((1000 - location) * weight_factor);
         wordRef->Anchor = anchor_number;
- strcpy(wordRef->Word, word);
+ wordRef->Word = word;
         words->Add(word, wordRef);
     }
 }
@@ -145,7 +146,7 @@ void WordList::Flush()
     while ((wordRef = (WordReference *) words->Get_NextElement()))
     {
 
- fprintf(fl, "%s",wordRef->Word);
+ fprintf(fl, "%s",wordRef->Word.get());
         fprintf(fl, "\ti:%d\tl:%d\tw:%d",
                 wordRef->DocumentID,
                 wordRef->Location,
@@ -220,15 +221,16 @@ void WordList::BadWordFile(char *filenam
     char buffer[1000];
     char *word;
     String new_word;
- int minimum_word_length = config.Value("minimum_word_length", 3);
+ static int minimum_word_length = config.Value("minimum_word_length", 3);
+ static int maximum_word_length = config.Value("maximum_word_length", 12);
 
     while (fl && fgets(buffer, sizeof(buffer), fl))
     {
         word = strtok(buffer, "\r\n \t");
         if (word && *word)
           {
- if (strlen(word) > MAX_WORD_LENGTH)
- word[MAX_WORD_LENGTH] = '\0';
+ if (strlen(word) > maximum_word_length)
+ word[maximum_word_length] = '\0';
             new_word = word; // We need to clean it up before we add it
             new_word.lowercase(); // Just in case someone enters an odd one
             HtStripPunctuation(new_word);
--- htdig-3.1.2.bak/htcommon/DocumentRef.cc Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htcommon/DocumentRef.cc Wed Aug 4 10:45:30 1999
@@ -571,8 +571,7 @@ void DocumentRef::AddDescription(char *d
     static double description_factor = config.Double("description_factor");
     static int max_descriptions = config.Value("max_descriptions", 5);
 
- // Not restricted to this size, just used as a hint.
- String word(MAX_WORD_LENGTH);
+ String word;
 
     while (*p)
     {
--- htdig-3.1.2.bak/htcommon/defaults.cc Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htcommon/defaults.cc Wed Aug 4 10:47:44 1999
@@ -89,6 +89,7 @@ ConfigDefaults defaults[] =
     {"max_prefix_matches", "1000"},
     {"max_stars", "4"},
     {"maximum_pages", "10"},
+ {"maximum_word_length", "12"},
     {"metaphone_db", "${database_base}.metaphone.db"},
     {"meta_description_factor", "50"},
     {"method_names", "and All or Any boolean Boolean"},
--- htdig-3.1.2.bak/htsearch/parser.cc Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htsearch/parser.cc Wed Aug 4 10:50:41 1999
@@ -202,6 +202,7 @@ Parser::setError(char *expected)
 void
 Parser::perform_push()
 {
+ static int maximum_word_length = config.Value("maximum_word_length", 12);
     String temp = current->word.get();
     String data;
     char *p;
@@ -220,8 +221,8 @@ Parser::perform_push()
     }
     temp.lowercase();
     p = temp.get();
- if (temp.length() > MAX_WORD_LENGTH)
- p[MAX_WORD_LENGTH] = '\0';
+ if (temp.length() > maximum_word_length)
+ p[maximum_word_length] = '\0';
     if (dbf->Get(p, data) == OK)
     {
         p = data.get();
--- htdig-3.1.2.bak/htdoc/attrs.html Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdoc/attrs.html Wed Aug 4 10:58:59 1999
@@ -3124,6 +3124,51 @@
         <hr>
         <dl>
           <dt>
+ <strong><a name="maximum_word_length">
+ maximum_word_length</a></strong>
+ </dt>
+ <dd>
+ <dl>
+ <dt>
+ <em>type:</em>
+ </dt>
+ <dd>
+ number
+ </dd>
+ <dt>
+ <em>used by:</em>
+ </dt>
+ <dd>
+ <a href="htdig.html">htdig</a> and
+ <a href="htsearch.html" target="_top">htsearch</a>
+ </dd>
+ <dt>
+ <em>default:</em>
+ </dt>
+ <dd>
+ 12
+ </dd>
+ <dt>
+ <em>description:</em>
+ </dt>
+ <dd>
+ This sets the maximum length of words that will be
+ indexed. Words longer than this value will be silently
+ truncated when put into the index, or searched in the
+ index.
+ </dd>
+ <dt>
+ <em>example:</em>
+ </dt>
+ <dd>
+ maximum_word_length: 15
+ </dd>
+ </dl>
+ </dd>
+ </dl>
+ <hr>
+ <dl>
+ <dt>
                 <strong><a name="meta_description_factor">
                 meta_description_factor</a></strong>
           </dt>
--- htdig-3.1.2.bak/htdoc/cf_byname.html Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdoc/cf_byname.html Wed Aug 4 10:59:30 1999
@@ -96,6 +96,7 @@
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#max_prefix_matches">max_prefix_matches</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#max_stars">max_stars</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#maximum_pages">maximum_pages</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#maximum_word_length">maximum_word_length</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#meta_description_factor">meta_description_factor</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#metaphone_db">metaphone_db</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#method_names">method_names</a><br>
--- htdig-3.1.2.bak/htdoc/cf_byprog.html Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdoc/cf_byprog.html Wed Aug 4 11:00:31 1999
@@ -54,6 +54,7 @@
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#max_head_length">max_head_length</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#max_hop_count">max_hop_count</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#max_meta_description_length">max_meta_description_length</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#maximum_word_length">maximum_word_length</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#meta_description_factor">meta_description_factor</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#minimum_word_length">minimum_word_length</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#modification_time_is_now">modification_time_is_now</a><br>
@@ -132,6 +133,7 @@
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#max_prefix_matches">max_prefix_matches</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#max_stars">max_stars</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#maximum_pages">maximum_pages</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#maximum_word_length">maximum_word_length</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#method_names">method_names</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#minimum_prefix_length">minimum_prefix_length</a><br>
           <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#minimum_word_length">minimum_word_length</a><br>

I think this patch will fix PR#514 in the bug database. It's Geoff's
first patch, with a minor correction, plus an added test in the vscode
macro, which is where the problem seemed to be happening. The author
of the metaphone code likely assumed that isalpha() meant [A-Za-z],
and forgot about upper half characters. This won't do anything to map
accented vowels to their unaccented counterparts, but it should hopefully
put an end to the segmentation faults.

--- htdig-3.1.2.bak/htfuzzy/Fuzzy.cc Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htfuzzy/Fuzzy.cc Fri Jul 30 16:37:42 1999
@@ -55,6 +55,8 @@ Fuzzy::getWords(char *word, List &words)
 {
     if (!index)
         return;
+ if (!word || !*word)
+ return;
 
     //
     // Convert the word to a fuzzy key
--- htdig-3.1.2.bak/htfuzzy/Metaphone.cc Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htfuzzy/Metaphone.cc Tue Aug 3 14:50:06 1999
@@ -51,7 +51,7 @@ static char vsvfn[26] = {
         /* N O P Q R S T U V W X Y Z */
 
 /* Macros to access character coding array */
-#define vscode(x) (vsvfn[(x) - 'A'])
+#define vscode(x) ((x) >= 'A' && (x) <= 'Z' ? vsvfn[(x) - 'A'] : 0)
 #define vowel(x) ((x) != '\0' && vscode(x) & 1) /* AEIOU */
 #define same(x) ((x) != '\0' && vscode(x) & 2) /* FJLMNR */
 #define varson(x) ((x) != '\0' && vscode(x) & 4) /* CGPST */
@@ -63,6 +63,9 @@ static char vsvfn[26] = {
 void
 Metaphone::generateKey(char *word, String &key)
 {
+ if (!word || !*word)
+ return;
+
     char *n;
     String ntrans;
         

This patch fixes the bug in the handling of modification_time_is_now in
the readHeader() function.

--- htdig-3.1.2/htdig/Document.cc.modnowbug Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdig/Document.cc Fri Jul 30 13:39:18 1999
@@ -96,10 +96,7 @@ Document::Reset()
       delete url;
     url = 0;
     referer = 0;
- if(config.Boolean("modification_time_is_now"))
- modtime = time(NULL);
- else
- modtime = 0;
+ modtime = 0;
 
     contents = 0;
     document_length = 0;
@@ -463,10 +460,7 @@ Document::readHeader(Connection &c)
     int inHeader = 1;
     int returnStatus = Header_not_found;
 
- if (config.Boolean("modification_time_is_now"))
- modtime = time(NULL);
- else
- modtime = 0;
+ modtime = 0;
 
     while (inHeader)
     {
@@ -542,6 +536,11 @@ Document::readHeader(Connection &c)
             }
         }
     }
+ static int modification_time_is_now =
+ config.Boolean("modification_time_is_now");
+ if (modtime == 0 && modification_time_is_now)
+ modtime = time(NULL);
+
     if (debug > 2)
         cout << "returnStatus = " << returnStatus << endl;
     return returnStatus;

This patch fixes <meta> robots parsing to allow multiple directives
to work correctly. Fixes PR#578, as provided by Chris Liddiard
<c.h.liddiard@qmw.ac.uk>.

--- htdig-3.1.2/htdig/HTML.cc.robotbug Fri Jul 30 12:24:14 1999
+++ htdig-3.1.2/htdig/HTML.cc Fri Jul 30 13:28:35 1999
@@ -873,9 +873,9 @@ HTML::do_tag(Retriever &retriever, Strin
                         doindex = 0;
                         retriever.got_noindex();
                       }
- else if (content_cache.indexOf("nofollow") != -1)
+ if (content_cache.indexOf("nofollow") != -1)
                       dofollow = 0;
- else if (content_cache.indexOf("none") != -1)
+ if (content_cache.indexOf("none") != -1)
                       {
                         doindex = 0;
                         dofollow = 0;

This patch fixes PR#572, where htsearch crashed if CONTENT_LENGTH was not set
but REQUEST_METHOD was.

--- htdig-3.1.2.bak/htlib/cgi.cc Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htlib/cgi.cc Wed Aug 4 16:51:49 1999
@@ -67,7 +67,9 @@
                 int n;
                 char *buf;
                 
- n = atoi(getenv("CONTENT_LENGTH"));
+ buf = getenv("CONTENT_LENGTH");
+ if (!buf || !*buf || (n = atoi(buf)) <= 0)
+ return; // null query
                 buf = new char[n + 1];
                 read(0, buf, n);
                 buf[n] = '\0';

This patch adds error messages for unknown hosts.

--- htdig-3.1.2/htdig/Document.cc.nohostmsg Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdig/Document.cc Fri Jul 30 13:48:03 1999
@@ -301,14 +301,22 @@ Document::RetrieveHTTP(time_t date)
         if (c.assign_port(proxy->port()) == NOTOK)
             return Document_not_found;
         if (c.assign_server(proxy->host()) == NOTOK)
+ {
+ if (debug)
+ cout << "Unknown proxy host: " << proxy->host() << endl;
             return Document_no_host;
+ }
     }
     else
     {
         if (c.assign_port(url->port()) == NOTOK)
             return Document_not_found;
         if (c.assign_server(url->host()) == NOTOK)
+ {
+ if (debug)
+ cout << "Unknown host: " << proxy->host() << endl;
             return Document_no_host;
+ }
     }
         
     if (c.connect(1) == NOTOK)

This patch fixes a bug in the PDF parser. When the Title header was just
the temporary file name, it wouldn't be used, but it also wouldn't be cleared
from the _parsedString variable, so it ended up polluting the document
excerpt.

--- htdig-3.1.2/htdig/PDF.cc.orig Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdig/PDF.cc Tue May 25 12:01:43 1999
@@ -290,8 +290,8 @@ void PDF::parseNonTextLine(String &line)
                         _parsedString.get());
 
                 _retriever->got_title(_parsedString);
- _parsedString = 0;
             }
+ _parsedString = 0;
         }
         
    }

This fixes the infamous problem with files like left_index.html not getting
indexed. PR#543 & PR#585.

--- htdig-3.1.2/htlib/URL.cc.orig Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htlib/URL.cc Fri Jun 11 12:24:40 1999
@@ -440,7 +440,7 @@ void URL::removeIndex(String &path)
       l.Release();
     }
     if (defaultdoc->hasPattern() &&
- defaultdoc->FindFirstWord(path.sub(filename)) >= 0)
+ defaultdoc->CompareWord(path.sub(filename)))
         path.chop(path.length() - filename);
 }
 

Fix server_alias entries so port defaults to 80 if omitted.

--- htdig-3.1.2/htlib/URL.cc.old Fri Jul 30 14:51:32 1999
+++ htdig-3.1.2/htlib/URL.cc Fri Jul 30 16:57:35 1999
@@ -540,6 +540,11 @@ char *URL::signature()
 }
 
 
+//*****************************************************************************
+// void URL::ServerAlias()
+// Takes care of the server aliases, which attempt to simplify virtual
+// host problems
+//
 void URL::ServerAlias()
 {
   static Dictionary *serveraliases= 0;
@@ -547,6 +552,7 @@ void URL::ServerAlias()
   if (! serveraliases)
     {
       String l= config["server_aliases"];
+ String from, *to;
       serveraliases = new Dictionary();
       char *p = strtok(l, " \t");
       char *salias= NULL;
@@ -556,7 +562,13 @@ void URL::ServerAlias()
           if (! salias)
             continue;
           *salias++= '\0';
- serveraliases->Add(p, new String(salias));
+ from = p;
+ if (from.indexOf(':') == -1)
+ from.append(":80");
+ to= new String(salias);
+ if (to->indexOf(':') == -1)
+ to->append(":80");
+ serveraliases->Add(from.get(), to);
           // cout << "Alias: " << p << "->" << salias << "\n";
           // printf ("Alias: %s->%s\n", p, salias);
           p = strtok(0, " \t");

This patch fixes the HTML parser to decode SGML entities within tag attributes.

--- htdig-3.1.2.bak/htdig/HTML.h Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdig/HTML.h Fri Jul 30 12:23:25 1999
@@ -72,6 +72,7 @@ private:
     // Helper functions
     //
     void do_tag(Retriever &, String &);
+ char *transSGML(char *);
 };
 
 #endif
--- htdig-3.1.2.bak/htdig/HTML.cc Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdig/HTML.cc Fri Jul 30 16:22:55 1999
@@ -544,7 +544,7 @@ HTML::do_tag(Retriever &retriever, Strin
                             in_ref = 0;
                         }
                         delete href;
- href = new URL(position, *base);
+ href = new URL(transSGML(position), *base);
                         in_ref = 1;
                         description = 0;
                         position = q + 1;
@@ -595,7 +595,7 @@ HTML::do_tag(Retriever &retriever, Strin
                                 q++;
                         *q = '\0';
                         }
- retriever.got_anchor(position);
+ retriever.got_anchor(transSGML(position));
                         position = q + 1;
                         break;
                     }
@@ -704,7 +704,7 @@ HTML::do_tag(Retriever &retriever, Strin
                     q++;
             *q = '\0';
             }
- retriever.got_image(position);
+ retriever.got_image(transSGML(position));
             break;
         }
 
@@ -736,15 +736,15 @@ HTML::do_tag(Retriever &retriever, Strin
               }
             if (conf["htdig-email"])
             {
- retriever.got_meta_email(conf["htdig-email"]);
+ retriever.got_meta_email(transSGML(conf["htdig-email"]));
             }
             if (conf["htdig-notification-date"])
             {
- retriever.got_meta_notification(conf["htdig-notification-date"]);
+ retriever.got_meta_notification(transSGML(conf["htdig-notification-date"]));
             }
             if (conf["htdig-email-subject"])
             {
- retriever.got_meta_subject(conf["htdig-email-subject"]);
+ retriever.got_meta_subject(transSGML(conf["htdig-email-subject"]));
             }
             if (conf["htdig-keywords"] || conf["keywords"])
             {
@@ -757,7 +757,7 @@ HTML::do_tag(Retriever &retriever, Strin
                 char *keywords = conf["htdig-keywords"];
                 if (!keywords)
                     keywords = conf["keywords"];
- char *w = strtok(keywords, " ,\t\r\n");
+ char *w = strtok(transSGML(keywords), " ,\t\r\n");
                 while (w)
                 {
                     if (strlen(w) >= minimumWordLength)
@@ -783,7 +783,7 @@ HTML::do_tag(Retriever &retriever, Strin
                         while (*qq && (*qq != ';') && (*qq != '"') &&
                                !isspace(*qq))qq++;
                         *qq = 0;
- URL *href = new URL(q, *base);
+ URL *href = new URL(transSGML(q), *base);
                         // I don't know why anyone would do this, but hey...
                         if (dofollow)
                           retriever.got_href(*href, "");
@@ -811,7 +811,7 @@ HTML::do_tag(Retriever &retriever, Strin
                     //
                     // We need to do two things. First grab the description
                     //
- meta_dsc = conf["content"];
+ meta_dsc = transSGML(conf["content"]);
                    if (meta_dsc.length() > max_meta_description_length)
                      meta_dsc = meta_dsc.sub(0, max_meta_description_length).get();
                    if (debug > 1)
@@ -824,7 +824,7 @@ HTML::do_tag(Retriever &retriever, Strin
                    // (slot 11 is the new slot for this)
                    //
 
- char *w = strtok(conf["content"], " \t\r\n");
+ char *w = strtok(transSGML(conf["content"]), " \t\r\n");
                    while (w)
                      {
                         if (strlen(w) >= minimumWordLength)
@@ -836,7 +836,7 @@ HTML::do_tag(Retriever &retriever, Strin
 
                 if (keywordsMatch.CompareWord(cache))
                 {
- char *w = strtok(conf["content"], " ,\t\r\n");
+ char *w = strtok(transSGML(conf["content"]), " ,\t\r\n");
                     while (w)
                     {
                         if (strlen(w) >= minimumWordLength)
@@ -847,15 +847,15 @@ HTML::do_tag(Retriever &retriever, Strin
                 }
                 else if (mystrcasecmp(cache, "htdig-email") == 0)
                 {
- retriever.got_meta_email(conf["content"]);
+ retriever.got_meta_email(transSGML(conf["content"]));
                 }
                 else if (mystrcasecmp(cache, "htdig-notification-date") == 0)
                 {
- retriever.got_meta_notification(conf["content"]);
+ retriever.got_meta_notification(transSGML(conf["content"]));
                 }
                 else if (mystrcasecmp(cache, "htdig-email-subject") == 0)
                 {
- retriever.got_meta_subject(conf["content"]);
+ retriever.got_meta_subject(transSGML(conf["content"]));
                 }
                 else if (mystrcasecmp(cache, "htdig-noindex") == 0)
                   {
@@ -948,7 +948,7 @@ HTML::do_tag(Retriever &retriever, Strin
                         *q = '\0';
                     }
                     delete href;
- href = new URL(position, *base);
+ href = new URL(transSGML(position), *base);
                     if (dofollow)
                     {
                         description = 0;
@@ -1016,7 +1016,7 @@ HTML::do_tag(Retriever &retriever, Strin
                         *q = '\0';
                     }
                     delete href;
- href = new URL(position, *base);
+ href = new URL(transSGML(position), *base);
                     if (dofollow)
                     {
                         description = 0;
@@ -1085,7 +1085,7 @@ HTML::do_tag(Retriever &retriever, Strin
                             q++;
                     *q = '\0';
                     }
- URL tempBase(position, *base);
+ URL tempBase(transSGML(position), *base);
                     *base = tempBase;
                 }
             }
@@ -1095,4 +1095,25 @@ HTML::do_tag(Retriever &retriever, Strin
         default:
             return; // Nothing...
     }
+}
+
+
+//*****************************************************************************
+// char * HTML::transSGML(char *text)
+//
+char *
+HTML::transSGML(char *str)
+{
+ static String convert;
+ unsigned char *text = (unsigned char *)str;
+
+ convert = 0;
+ while (*text)
+ {
+ if (*text == '&')
+ convert << SGMLEntities::translateAndUpdate(text);
+ else
+ convert << *text++;
+ }
+ return convert.get();
 }

Fix PR#566 by setting the correct length of the string being
matched. 'http://' is 7 characters. Submitted by
<wolfgang.pichler@creditanstalt.co.at>.

--- htdig-3.1.2.bak/htlib/URL.cc Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htlib/URL.cc Fri Jul 30 14:51:32 1999
@@ -130,7 +130,7 @@ URL::URL(char *ref, URL &parent)
     while (isalpha(*p))
         p++;
     int hasService = (*p == ':');
- if (hasService && ((strncmp(ref, "http://", 6) == 0) ||
+ if (hasService && ((strncmp(ref, "http://", 7) == 0) ||
                        (strncmp(ref, "http:", 5) != 0)))
     {
         //

Fixes problem with $(VAR) at end of template string not being expanded.

--- htdig-3.1.2/htsearch/Display.cc.varstatebug Fri Jul 30 14:24:05 1999
+++ htdig-3.1.2/htsearch/Display.cc Fri Jul 30 15:25:09 1999
@@ -822,7 +822,7 @@ Display::expandVariables(char *str)
         }
         str++;
     }
- if (state == 5)
+ if (state == 2 || state == 5)
     {
         //
         // The end of string was reached, but we are still trying to

-------- 8< -------- snip -------- 8< --------

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Aug 06 1999 - 15:23:19 PDT