Re: [htdig] honking big patch file collection for 3.1.2


Frank Guangxin Liu (frank@ctcqnx4.ctc.cummins.com)
Mon, 09 Aug 1999 10:51:56 -0500 (EST)


On Fri, 6 Aug 1999, Gilles Detillieux wrote:
> - username/password now blotted out from command arguments

Is the multi username/password support we discussed earlier in there?
Thanks!
Frank

> - adds support for <embed>, <object> and <link> tags
> - PR#554 fixed - locale now affects default date format in htsearch
> - fixes the bug in the handling of modification_time_is_now
> - PR#578 fixed - multiple directives in <meta> robots tag now work
> - now gives an error message for unknown hosts
> - empty or null strings won't cause htfuzzy to core dump
> - PDF parser now clears title string properly when done with it
> - PR#543 & PR#585 fixed - names like left_index.html no longer stripped
> - fixes server_alias entries so port defaults to 80 if omitted
> - decodes SGML entities inside tag attributes
> - PR#566 fixed - urls like 'http:/dir/file.ext' resolved properly
> - $(VAR) at end of template string now being expanded properly
> - PR#595 fixed - corrected address for FSF
> - maximum word length now a config attribute, not compile-time option
> - PR#81 & PR#472 fixed - htdig -vvv shouldn't crash in strftime()
> - PR#348 fixed - missing or invalid port number will get set correctly
> - PR#493 fixed - valid URL with ".." within a file name not rejected
> - PR#572 fixed - htsearch won't crash if CONTENT_LENGTH not set
> - PR#545 fixed - configure tests for presence of alloca.h for regex.c
> - documentation updates, including PR#558 & PR#626.
>
>
> -------- 8< -------- snip -------- 8< --------
> This patch should fix PR#545, to test for presence of alloca.h
>
> --- htdig-3.1.2.bak/configure.in Wed Apr 21 21:47:53 1999
> +++ htdig-3.1.2/configure.in Wed Aug 4 16:17:57 1999
> @@ -13,7 +13,7 @@
> #
> # You should have received a copy of the GNU General Public License
> # along with this program; if not, write to the Free Software
> -# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> #
>
> AC_INIT(htcommon/DocumentDB.cc)
> @@ -79,7 +79,7 @@
>
> dnl More header checks--here use C++
> AC_LANG_CPLUSPLUS
> -AC_CHECK_HEADERS(fcntl.h limits.h malloc.h sys/file.h sys/ioctl.h sys/time.h unistd.h getopt.h strings.h zlib.h)
> +AC_CHECK_HEADERS(fcntl.h limits.h malloc.h sys/file.h sys/ioctl.h sys/time.h unistd.h getopt.h strings.h zlib.h alloca.h)
> AC_CHECK_HEADER(fstream.h,nofstream=0,nofstream=1)
> if test "x$nofstream" = "x1" ; then
> AC_MSG_ERROR([To compile ht://Dig, you will need a C++ library. Try installing libstdc++.])
> --- htdig-3.1.2.bak/configure Wed Apr 21 21:47:53 1999
> +++ htdig-3.1.2/configure Wed Aug 4 16:17:57 1999
> @@ -2010,7 +2010,7 @@
> CXXCPP="$ac_cv_prog_CXXCPP"
> echo "$ac_t""$CXXCPP" 1>&6
>
> -for ac_hdr in fcntl.h limits.h malloc.h sys/file.h sys/ioctl.h sys/time.h unistd.h getopt.h strings.h zlib.h
> +for ac_hdr in fcntl.h limits.h malloc.h sys/file.h sys/ioctl.h sys/time.h unistd.h getopt.h strings.h zlib.h alloca.h
> do
> ac_safe=`echo "$ac_hdr" | sed 'y%./+-%__p_%'`
> echo $ac_n "checking for $ac_hdr""... $ac_c" 1>&6
> --- htdig-3.1.2.bak/include/htconfig.h.in Wed Apr 21 21:47:58 1999
> +++ htdig-3.1.2/include/htconfig.h.in Wed Aug 4 16:30:10 1999
> @@ -55,6 +55,9 @@
>
> /* Define if you have the <zlib.h> header file. */
> #undef HAVE_ZLIB_H
> +
> +/* Define if you have the <alloca.h> header file. */
> +#undef HAVE_ALLOCA_H
>
> /* Define if you have the <sys/file.h> header file. */
> #undef HAVE_SYS_FILE_H
> --- htdig-3.1.2.bak/htlib/regex.c Wed Apr 21 21:47:58 1999
> +++ htdig-3.1.2/htlib/regex.c Wed Aug 4 16:20:48 1999
> @@ -27,6 +27,7 @@
> #undef _GNU_SOURCE
> #define _GNU_SOURCE
>
> +#include <htconfig.h>
> #ifdef HAVE_CONFIG_H
> # include <config.h>
> #endif
>
> This adds descriptions for attributes that were missing, adds a few
> clarifications, and corrects a few defaults and typos. Covers PR#558,
> PR#626, and then some.
>
> --- htdig-3.1.2.bak/htdoc/attrs.html Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdoc/attrs.html Fri Aug 6 14:00:28 1999
> @@ -413,6 +413,57 @@
> <hr>
> <dl>
> <dt>
> + <strong><a name="bin_dir">bin_dir</a></strong>
> + </dt>
> + <dd>
> + <dl>
> + <dt>
> + <em>type:</em>
> + </dt>
> + <dd>
> + string
> + </dd>
> + <dt>
> + <em>used by:</em>
> + </dt>
> + <dd>
> + <a href="htdig.html">htdig</a>,
> + <a href="htnotify.html">htnotify</a>,
> + <a href="htfuzzy.html">htfuzzy</a>,
> + <a href="htmerge.html">htmerge</a> and
> + <a href="htsearch.html" target="_top">htsearch</a>
> + </dd>
> + <dt>
> + <em>default:</em>
> + </dt>
> + <dd>
> + BIN_DIR
> + </dd>
> + <dt>
> + <em>description:</em>
> + </dt>
> + <dd>
> + This is the directory in which the executables
> + related to ht://Dig are installed. It is never used
> + directly by any of the programs, but other attributes
> + can be defined in terms of this one.
> + <p>
> + The default value of this attribute is determined at
> + compile time.
> + </p>
> + </dd>
> + <dt>
> + <em>example:</em>
> + </dt>
> + <dd>
> + bin_dir: /usr/local/bin
> + </dd>
> + </dl>
> + </dd>
> + </dl>
> + <hr>
> + <dl>
> + <dt>
> <strong><a name="case_sensitive">case_sensitive</a></strong>
> </dt>
> <dd>
> @@ -595,7 +646,8 @@
> <dd>
> If specified and the <a
> href="http://www.cdrom.com/pub/infozip/zlib/">zlib</a>
> - compression library was available when compiledi controls
> + compression library was available when compiled,
> + this attribute controls
> the amount of compression used in the <a
> href="#doc_db">doc_db</a> file. Defaults to zero to
> provide backward compatility with old databases.
> @@ -612,6 +664,58 @@
> <hr>
> <dl>
> <dt>
> + <strong><a name="config_dir">config_dir</a></strong>
> + </dt>
> + <dd>
> + <dl>
> + <dt>
> + <em>type:</em>
> + </dt>
> + <dd>
> + string
> + </dd>
> + <dt>
> + <em>used by:</em>
> + </dt>
> + <dd>
> + <a href="htdig.html">htdig</a>,
> + <a href="htnotify.html">htnotify</a>,
> + <a href="htfuzzy.html">htfuzzy</a>,
> + <a href="htmerge.html">htmerge</a> and
> + <a href="htsearch.html" target="_top">htsearch</a>
> + </dd>
> + <dt>
> + <em>default:</em>
> + </dt>
> + <dd>
> + CONFIG_DIR
> + </dd>
> + <dt>
> + <em>description:</em>
> + </dt>
> + <dd>
> + This is the directory which contains all configuration
> + files related to ht://Dig. It is never used
> + directly by any of the programs, but other attributes
> + or the <a href="#include">include</a> directive
> + can be defined in terms of this one.
> + <p>
> + The default value of this attribute is determined at
> + compile time.
> + </p>
> + </dd>
> + <dt>
> + <em>example:</em>
> + </dt>
> + <dd>
> + config_dir: /var/htdig/conf
> + </dd>
> + </dl>
> + </dd>
> + </dl>
> + <hr>
> + <dl>
> + <dt>
> <strong><a name="create_image_list">
> create_image_list</a></strong>
> </dt>
> @@ -1459,7 +1563,7 @@
> <em>default:</em>
> </dt>
> <dd>
> - cgi-bin .cgi
> + /cgi-bin/ .cgi
> </dd>
> <dt>
> <em>description:</em>
> @@ -2136,6 +2240,103 @@
> <hr>
> <dl>
> <dt>
> + <strong><a name="image_url_prefix">image_url_prefix</a></strong>
> + </dt>
> + <dd>
> + <dl>
> + <dt>
> + <em>type:</em>
> + </dt>
> + <dd>
> + string
> + </dd>
> + <dt>
> + <em>used by:</em>
> + </dt>
> + <dd>
> + <a href="htsearch.html" target="_top">htsearch</a>
> + </dd>
> + <dt>
> + <em>default:</em>
> + </dt>
> + <dd>
> + IMAGE_URL_PREFIX
> + </dd>
> + <dt>
> + <em>description:</em>
> + </dt>
> + <dd>
> + This specifies the directory portion of the URL used
> + to display star images. This attribute isn't directly
> + used by htsearch, but is used in the default URL for
> + the <a href="#star_image">star_image</a> and
> + <a href="#star_blank">star_blank</a> attributes, and
> + other attributes may be defined in terms of this one.
> + <p>
> + The default value of this attribute is determined at
> + compile time.
> + </p>
> + </dd>
> + <dt>
> + <em>example:</em>
> + </dt>
> + <dd>
> + image_url_prefix: /images/htdig
> + </dd>
> + </dl>
> + </dd>
> + </dl>
> + <hr>
> + <dl>
> + <dt>
> + <strong><a name="include">include</a></strong>
> + </dt>
> + <dd>
> + <dl>
> + <dt>
> + <em>type:</em>
> + </dt>
> + <dd>
> + string
> + </dd>
> + <dt>
> + <em>used by:</em>
> + </dt>
> + <dd>
> + <a href="htdig.html">htdig</a>,
> + <a href="htnotify.html">htnotify</a>,
> + <a href="htfuzzy.html">htfuzzy</a>,
> + <a href="htmerge.html">htmerge</a> and
> + <a href="htsearch.html" target="_top">htsearch</a>
> + </dd>
> + <dt>
> + <em>description:</em>
> + </dt>
> + <dd>
> + This is not quite a configuration attribute, but
> + rather a directive. It can be used within one
> + configuration file to include the definitions of
> + another file. The last definition of an attribute
> + is the one that applies, so after including a file,
> + any of its definitions can be overridden with
> + subsequent definitions. This can be useful when
> + setting up many configurations that are mostly the
> + same, so all the common attributes can be maintained
> + in a single configuration file. The include directives
> + can be nested, but watch out for nesting loops.
> + </dd>
> + <dt>
> + <em>example:</em>
> + </dt>
> + <dd>
> + include: ${config_dir}/htdig.conf
> + </dd>
> + </dl>
> + </dd>
> + </dl>
> + <hr>
> + <dl>
> + <dt>
> <strong><a name="iso_8601">iso_8601</a></strong>
> </dt>
> <dd>
> @@ -4045,6 +4246,11 @@
> that is part of the <a
> href="http://www.foolabs.com/xpdf/">xpdf</a>
> 0.80 package have been tested as pdf_parsers.
> + <p>
> + The default value of this attribute is determined at
> + compile time, to include the path to the acroread
> + executable.
> + </p>
> </dd>
> <dt>
> <em>example:</em>
> @@ -4521,6 +4727,10 @@
> if no matches were found. In this case the
> <a href="#nothing_found_file">nothing_found_file</a>
> attribute is used instead.
> + Also, this file will not be output if it is
> + overridden by defining the
> + <a href="#search_results_wrapper">search_results_wrapper</a>
> + attribute.
> </dd>
> <dt>
> <em>example:</em>
> @@ -4633,6 +4843,10 @@
> if no matches were found. In this case the
> <a href="#nothing_found_file">nothing_found_file</a>
> attribute is used instead.
> + Also, this file will not be output if it is
> + overridden by defining the
> + <a href="#search_results_wrapper">search_results_wrapper</a>
> + attribute.
> </dd>
> <dt>
> <em>example:</em>
> @@ -6256,7 +6470,7 @@
> <em>default:</em>
> </dt>
> <dd>
> - .-_/!#$%^&amp;*'
> + .-_/!#$%^&amp;'
> </dd>
> <dt>
> <em>description:</em>
> @@ -6285,6 +6499,50 @@
> <hr>
> <dl>
> <dt>
> + <strong><a name="version">version</a></strong>
> + </dt>
> + <dd>
> + <dl>
> + <dt>
> + <em>type:</em>
> + </dt>
> + <dd>
> + string
> + </dd>
> + <dt>
> + <em>used by:</em>
> + </dt>
> + <dd>
> + <a href="htsearch.html" target="_top">htsearch</a>
> + </dd>
> + <dt>
> + <em>default:</em>
> + </dt>
> + <dd>
> + VERSION
> + </dd>
> + <dt>
> + <em>description:</em>
> + </dt>
> + <dd>
> + This specifies the value of the VERSION
> + variable which can be used in search templates.
> + The default value of this attribute is determined
> + at compile time, and will not normally be set
> + in configuration files.
> + </dd>
> + <dt>
> + <em>example:</em>
> + </dt>
> + <dd>
> + version: 3.1.2PL1
> + </dd>
> + </dl>
> + </dd>
> + </dl>
> + <hr>
> + <dl>
> + <dt>
> <strong><a name="word_db">word_db</a></strong>
> </dt>
> <dd>
> @@ -6385,7 +6643,7 @@
> <a href="author.html">Andrew Scherpbier &lt;andrew@contigo.com&gt;</a>
> </address>
> <!-- hhmts start -->
> -Last modified: Sun Feb 14 21:51:44 EST 1999
> +Last modified: Fri Aug 6 15:00:15 EDT 1999
> <!-- hhmts end -->
> </body>
> </html>
> --- htdig-3.1.2.bak/htdoc/cf_byname.html Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdoc/cf_byname.html Fri Aug 6 14:16:41 1999
> @@ -24,12 +24,14 @@
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#bad_extensions">bad_extensions</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#bad_querystr">bad_querystr</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#bad_word_list">bad_word_list</a><br>
> + <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#bin_dir">bin_dir</a><br>
> </font> <br>
> <b>C</b> <font face="helvetica,arial" size="2"><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#case_sensitive">case_sensitive</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#common_dir">common_dir</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#common_url_parts">common_url_parts</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#compression_level">compression_level</a><br>
> + <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#config_dir">config_dir</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#create_image_list">create_image_list</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#create_url_list">create_url_list</a><br>
> </font> <br>
> @@ -68,6 +70,8 @@
> </font> <br>
> <b>I</b> <font face="helvetica,arial" size="2"><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#image_list">image_list</a><br>
> + <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#image_url_prefix">image_url_prefix</a><br>
> + <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#include">include</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#iso_8601">iso_8601</a><br>
> </font> <br>
> <b>K</b> <font face="helvetica,arial" size="2"><br>
> @@ -170,6 +174,7 @@
> </font> <br>
> <b>V</b> <font face="helvetica,arial" size="2"><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#valid_punctuation">valid_punctuation</a><br>
> + <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#version">version</a><br>
> </font> <br>
> <b>W</b> <font face="helvetica,arial" size="2"><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#word_db">word_db</a><br>
> --- htdig-3.1.2.bak/htdoc/cf_byprog.html Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdoc/cf_byprog.html Fri Aug 6 14:19:45 1999
> @@ -168,6 +168,7 @@
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#use_meta_description">use_meta_description</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#use_star_image">use_star_image</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#valid_punctuation">valid_punctuation</a><br>
> + <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#version">version</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#word_db">word_db</a><br>
> </font>
> <form action="http://www.htdig.org/cgi-bin/htsearch" target=body>
>
> We uncovered a bug back on May 20, in the encodeURL() function. This
> function should encode all non-ascii characters, but right now it doesn't.
> I think this is what PR#339 was all about. Here's the fix:
>
> --- htdig-3.1.2/htlib/URLTrans.cc.orig Tue Feb 16 23:03:56 1999
> +++ htdig-3.1.2/htlib/URLTrans.cc Wed Jun 2 08:29:05 1999
> @@ -75,7 +75,7 @@ void encodeURL(String &str, char *valid)
>
> for (p = str; p && *p; p++)
> {
> - if (isdigit(*p) || isalpha(*p) || strchr(valid, *p))
> + if (isascii(*p) && (isdigit(*p) || isalpha(*p) || strchr(valid, *p)))
> temp << *p;
> else
> {
>
> Suffix-handling improvement (PR#560), to prevent inappropriate suffix
> stripping in endings fuzzy matches.
>
> > From: Steve Arlow <yorick@ClarkHill.com>
> > Subject: Suffix-handling improvement
> > To: htdig3-bugs@htdig.org
> > Date: Tue, 8 Jun 1999 19:57:54 -0400 (EDT)
> > Cc: yorick@yorick.com
> >
> > Hello,
> >
> > I do consulting for a number of law firms, and quickly discovered a
> > problem with htfuzzy matching on the word "witness". (There are
> > three root words in the distribution dictionary that end in "-ness"
> > and also certainly exhibit this problem; the other two are
> > "highness" and "likeness". Other words can also be argued about.)
> >
> > The fix (which does not appear to break anything else AFAICT, but
> > may have a small effect on performance) is to add a preliminary check
> > on root2word before trying word2root. The code is below (from the
> > file htdig-3.1.2/htfuzzy/Endings.cc), optimize it to your taste.
>
> Follow-up example:
> > Words of the form XXXness which are not a form of the word XXX. If I
> > enter "witness" into htdig with matching for alternate endings enabled,
> > it will look for "wit", "wits", or "witness". What it should really be
> > looking for is "witness", "witnessed", "witnessing", or "witnesses".
> >
> > A similar problem might occur with other suffixes, but I can't think of
> > an example off the top of my head.
> >
> > The fix is to try to interpret each term as a root word before trying
> > to interpret it as an alternate form.
>
> --- htdig-3.1.2/htfuzzy/Endings.cc.endingsbug Wed Apr 21 21:47:58 1999
> +++ htdig-3.1.2/htfuzzy/Endings.cc Fri Jul 30 14:43:57 1999
> @@ -68,22 +68,6 @@ Endings::getWords(char *w, List &words)
> String word = w;
> word.lowercase();
>
> - if (word2root->Get(word, data) == OK)
> - {
> - //
> - // Found the root of the word. We'll add it to the list already
> - //
> - word = data;
> - words.Add(new String(word));
> - }
> - else
> - {
> - //
> - // The root wasn't found. This could mean that the word
> - // is already the root.
> - //
> - }
> -
> if (root2word->Get(word, data) == OK)
> {
> //
> @@ -97,6 +81,40 @@ Endings::getWords(char *w, List &words)
> words.Add(new String(token));
> }
> token = strtok(0, " ");
> + }
> + }
> + else
> + {
> + if (word2root->Get(word, data) == OK)
> + {
> + //
> + // Found the root of the word. We'll add it to the list already
> + //
> + word = data;
> + words.Add(new String(word));
> + }
> + else
> + {
> + //
> + // The root wasn't found. This could mean that the word
> + // is already the root.
> + //
> + }
> +
> + if (root2word->Get(word, data) == OK)
> + {
> + //
> + // Found the root's permutations
> + //
> + char *token = strtok(data.get(), " ");
> + while (token)
> + {
> + if (mystrcasecmp(token, w) != 0)
> + {
> + words.Add(new String(token));
> + }
> + token = strtok(0, " ");
> + }
> }
> }
> }
>
> Quote the filename before passing it to the command-line to prevent
> shell escapes. Fixes PR#542. Also make error messages more useful.
>
> --- htdig-3.1.2/htdig/ExternalParser.cc.old Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdig/ExternalParser.cc Fri Jul 30 15:08:57 1999
> @@ -133,8 +133,8 @@ ExternalParser::parse(Retriever &retriev
> // Now start the external parser.
> //
> String command = currentParser;
> - command << ' ' << path << ' ' << contentType << ' ' << base.get() <<
> - ' ' << configFile;
> + command << ' ' << path << ' ' << contentType << " \"" << base.get() <<
> + "\" " << configFile;
>
> FILE *input = popen(command, "r");
> if (!input)
> @@ -170,7 +170,7 @@ ExternalParser::parse(Retriever &retriev
> (hd = atoi(token3)) >= 0 && hd < 12)
> retriever.got_word(token1, loc, hd);
> else
> - cerr<< "External parser error in line:"<<line<<"\n";
> + cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
> break;
>
> case 'u': // href
> @@ -183,7 +183,7 @@ ExternalParser::parse(Retriever &retriev
> retriever.got_href(url, token2);
> }
> else
> - cerr<< "External parser error in line:"<<line<<"\n";
> + cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
> break;
>
> case 't': // title
> @@ -191,7 +191,7 @@ ExternalParser::parse(Retriever &retriev
> if (token1 != NULL)
> retriever.got_title(token1);
> else
> - cerr<< "External parser error in line:"<<line<<"\n";
> + cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
> break;
>
> case 'h': // head
> @@ -199,7 +199,7 @@ ExternalParser::parse(Retriever &retriev
> if (token1 != NULL)
> retriever.got_head(token1);
> else
> - cerr<< "External parser error in line:"<<line<<"\n";
> + cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
> break;
>
> case 'a': // anchor
> @@ -207,7 +207,7 @@ ExternalParser::parse(Retriever &retriev
> if (token1 != NULL)
> retriever.got_anchor(token1);
> else
> - cerr<< "External parser error in line:"<<line<<"\n";
> + cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
> break;
>
> case 'i': // image url
> @@ -215,7 +215,7 @@ ExternalParser::parse(Retriever &retriev
> if (token1 != NULL)
> retriever.got_image(token1);
> else
> - cerr<< "External parser error in line:"<<line<<"\n";
> + cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
> break;
>
> case 'm': // meta
> @@ -329,12 +329,12 @@ ExternalParser::parse(Retriever &retriev
> }
> }
> else
> - cerr<< "External parser error in line:"<<line<<"\n";
> + cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
> break;
> }
>
> default:
> - cerr<< "External parser error in line:"<<line<<"\n";
> + cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << base.get() << "\n";
> break;
> }
> }
>
> Fix declaration to refer to first as reference--ensures ANCHOR is properly
> set. Fixes PR#541 as suggested by <pmb1@york.ac.uk>.
>
> --- htdig-3.1.2.bak/htsearch/Display.h Wed Apr 21 21:47:58 1999
> +++ htdig-3.1.2/htsearch/Display.h Fri Jul 30 14:23:56 1999
> @@ -151,7 +151,7 @@ protected:
> String *readFile(char *);
> void expandVariables(char *);
> void outputVariable(char *);
> - String *excerpt(DocumentRef *ref, String urlanchor, int fanchor, int first);
> + String *excerpt(DocumentRef *ref, String urlanchor, int fanchor, int &first);
> char *hilight(char *str, String urlanchor, int fanchor);
> void setupImages();
> String *generateStars(DocumentRef *, int);
> --- htdig-3.1.2.bak/htsearch/Display.cc Wed Apr 21 21:47:58 1999
> +++ htdig-3.1.2/htsearch/Display.cc Fri Jul 30 14:24:05 1999
> @@ -959,7 +959,7 @@ Display::buildMatchList()
>
> //*****************************************************************************
> String *
> -Display::excerpt(DocumentRef *ref, String urlanchor, int fanchor, int first)
> +Display::excerpt(DocumentRef *ref, String urlanchor, int fanchor, int &first)
> {
> char *head;
> int use_meta_description = 0;
>
> This patch fixes PR#348, to make sure a missing or invalid port number will
> get set correctly.
>
> --- htdig-3.1.2.bak/htlib/URL.cc Wed Apr 21 21:47:58 1999
> +++ htdig-3.1.2/htlib/URL.cc Wed Aug 4 13:09:01 1999
> @@ -282,6 +282,8 @@ void URL::parse(char *u)
> p = strtok(0, "/");
> if (p)
> _port = atoi(p);
> + if (!p || _port <= 0)
> + _port = 80;
> }
> else
> {
>
> This should fix PR#493, to avoid rejecting a valid URL with ".." in it.
>
> --- htdig-3.1.2.bak/htdig/Retriever.cc Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdig/Retriever.cc Wed Aug 4 15:51:44 1999
> @@ -625,7 +625,7 @@ Retriever::IsValidURL(char *u)
> // Currently, we only deal with HTTP URLs. Gopher and ftp will
> // come later... ***FIX***
> //
> - if (strstr(u, "..") || strncmp(u, "http://", 7) != 0)
> + if (strstr(u, "/../") || strncmp(u, "http://", 7) != 0)
> {
> if (debug > 2)
> cout << endl <<" Rejected: Not an http or relative link!";
>
> This updates the FSF address in COPYING & Makefile.in. PR#595.
> The address is still old in configure.in, but we won't touch it
> here so that we don't need to run autoconf.
>
> --- htdig3.1.2.bak/COPYING Tue Feb 16 23:03:53 1999
> +++ htdig3.1.2/COPYING Wed Aug 4 07:40:22 1999
> @@ -2,7 +2,7 @@
> Version 2, June 1991
>
> Copyright (C) 1989, 1991 Free Software Foundation, Inc.
> - 675 Mass Ave, Cambridge, MA 02139, USA
> + 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> Everyone is permitted to copy and distribute verbatim copies
> of this license document, but changing it is not allowed.
>
> @@ -305,7 +305,8 @@
>
> You should have received a copy of the GNU General Public License
> along with this program; if not, write to the Free Software
> - Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> +
>
> Also add information on how to contact you by electronic and paper mail.
>
> --- htdig3.1.2.bak/htdoc/COPYING Tue Feb 16 23:03:53 1999
> +++ htdig3.1.2/htdoc/COPYING Wed Aug 4 07:40:22 1999
> @@ -2,7 +2,7 @@
> Version 2, June 1991
>
> Copyright (C) 1989, 1991 Free Software Foundation, Inc.
> - 675 Mass Ave, Cambridge, MA 02139, USA
> + 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> Everyone is permitted to copy and distribute verbatim copies
> of this license document, but changing it is not allowed.
>
> @@ -305,7 +305,8 @@
>
> You should have received a copy of the GNU General Public License
> along with this program; if not, write to the Free Software
> - Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> +
>
> Also add information on how to contact you by electronic and paper mail.
>
> --- htdig-3.1.2.bak/Makefile.in Wed Apr 21 21:47:53 1999
> +++ htdig-3.1.2/Makefile.in Wed Aug 4 10:10:54 1999
> @@ -13,7 +13,7 @@
>
> # You should have received a copy of the GNU General Public License
> # along with this program; if not, write to the Free Software
> -# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
>
> top_srcdir= @top_srcdir@
> srcdir= @srcdir@
>
> This should help with PR#81 & PR#472, where strftime() would crash on
> some systems. Idea submitted by benoit.sibaud@cnet.francetelecom.fr.
>
> --- htdig-3.1.2.bak/htdig/Document.cc Wed Aug 4 12:43:27 1999
> +++ htdig-3.1.2/htdig/Document.cc Wed Aug 4 13:37:43 1999
> @@ -215,6 +215,8 @@ Document::getdate(char *datestring)
> // correct for mystrptime, if %Y format saw only a 2 digit year
> if (tm.tm_year < 0)
> tm.tm_year += 1900;
> + tm.tm_yday = 0; // clear these to prevent problems in strftime()
> + tm.tm_wday = 0;
>
> if (debug > 2)
> {
>
> This patch fixes a few problems with header parsing, including PR#535 & PR#557.
>
> --- htdig-3.1.2/htdig/Document.cc.hdrparsebug Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdig/Document.cc Fri Jul 30 14:15:10 1999
> @@ -478,14 +478,18 @@ Document::readHeader(Connection &c)
> inHeader = 0;
> else
> {
> + char *token = line.get();
> + while (*token && !isspace(*token))
> + token++;
> + while (*token && isspace(*token))
> + token++;
> if (strncmp(line, "HTTP/", 5) == 0)
> {
> //
> // Found the status line. This will determine if we
> // continue or not
> //
> - strtok(line, " ");
> - char *status = strtok(0, " ");
> + char *status = strtok(token, " ");
> if (status && strcmp(status, "200") == 0)
> {
> returnStatus = Header_ok;
> @@ -508,22 +512,19 @@ Document::readHeader(Connection &c)
> returnStatus = Header_not_authorized;
> }
> }
> - else if (modtime == 0
> + else if (modtime == 0 && *token
> && mystrncasecmp(line, "last-modified:", 14) == 0)
> {
> - strtok(line, " \t");
> - modtime = getdate(strtok(0, "\n\t"));
> + modtime = getdate(strtok(token, "\n\t"));
> }
> - else if (contentLength == -1
> + else if (contentLength == -1 && *token
> && mystrncasecmp(line, "content-length:", 15) == 0)
> {
> - strtok(line, " \t");
> - contentLength = atoi(strtok(0, "\n\t"));
> + contentLength = atoi(strtok(token, "\n\t"));
> }
> - else if (mystrncasecmp(line, "content-type:", 13) == 0)
> + else if (*token && mystrncasecmp(line, "content-type:", 13) == 0)
> {
> - strtok(line, " \t");
> - char *token = strtok(0, "\n\t");
> + token = strtok(token, "\n\t");
>
> if ((returnStatus == Header_not_found ||
> returnStatus == Header_ok) &&
> @@ -537,8 +538,7 @@ Document::readHeader(Connection &c)
> }
> else if (mystrncasecmp(line, "location:", 9) == 0)
> {
> - strtok(line, " \t");
> - redirected_to = strtok(0, "\r\n \t");
> + redirected_to = strtok(token, "\r\n \t");
> }
> }
> }
>
> This is Geoff's patch to hide the username/password in the command line
> arguments.
>
> --- htdig-3.1.2/htdig/htdig.cc.orig Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdig/htdig.cc Fri Jul 30 17:24:32 1999
> @@ -79,6 +79,8 @@ main(int ac, char **av)
> break;
> case 'u':
> credentials = optarg;
> + for (int pos = 0; pos < strlen(optarg); pos++)
> + optarg[pos] = '*';
> break;
> case 'a':
> alt_work_area++;
>
> This patch adds support for <embed>, <object> and <link> tags.
> (Don't you wish all additions could be this easy?)
>
> --- htdig-3.1.2/htdig/HTML.cc.old Fri Jul 30 12:24:14 1999
> +++ htdig-3.1.2/htdig/HTML.cc Fri Jul 30 13:16:55 1999
> @@ -63,7 +63,7 @@ HTML::HTML()
> // the attrs Match object is used to match names of tag parameters.
> //
> tags.IgnoreCase();
> - tags.Pattern("title|/title|a|/a|h1|h2|h3|h4|h5|h6|/h1|/h2|/h3|/h4|/h5|/h6|noindex|/noindex|img|li|meta|frame|area|base");
> + tags.Pattern("title|/title|a|/a|h1|h2|h3|h4|h5|h6|/h1|/h2|/h3|/h4|/h5|/h6|noindex|/noindex|img|li|meta|frame|area|base|embed|object|link");
>
> attrs.IgnoreCase();
> attrs.Pattern("src|href|name");
> @@ -894,6 +894,8 @@ HTML::do_tag(Retriever &retriever, Strin
> }
>
> case 21: // frame
> + case 24: // embed
> + case 25: // object
> {
> which = -1;
> int pos = srcMatch.FindFirstWord(position, which, length);
> @@ -963,6 +965,7 @@ HTML::do_tag(Retriever &retriever, Strin
> }
>
> case 22: // area
> + case 26: // link
> {
> which = -1;
> int pos = hrefMatch.FindFirstWord(position, which, length);
> @@ -972,7 +975,7 @@ HTML::do_tag(Retriever &retriever, Strin
> case 0: // "href"
> {
> //
> - // src seen
> + // href seen
> //
> while (*position && *position != '=')
> position++;
>
> Torsten Neuer's <tneuer@inwise.de> fix for PR# 554.
>
> --- htdig-3.1.2.bak/htsearch/Display.cc Wed Apr 21 21:47:58 1999
> +++ htdig-3.1.2/htsearch/Display.cc Tue Aug 3 14:46:30 1999
> @@ -20,6 +20,7 @@ static char RCSid[] = "$Id: Display.cc,v
> #include <stdio.h>
> #include <ctype.h>
> #include <syslog.h>
> +#include <locale.h>
> #include "HtURLCodec.h"
> #include "HtWordType.h"
>
> @@ -318,6 +319,7 @@ Display::displayMatch(ResultMatch *match
> {
> struct tm *tm = localtime(&t);
> char *datefmt = config["date_format"];
> + char *locale = config["locale"];
> if (!datefmt || !*datefmt)
> {
> if (config.Boolean("iso_8601"))
> @@ -325,6 +327,10 @@ Display::displayMatch(ResultMatch *match
> else
> datefmt = "%x";
> }
> + if ( locale && *locale )
> + {
> + setlocale(LC_TIME,locale);
> + }
> strftime(buffer, sizeof(buffer), datefmt, tm);
> *str << buffer;
> }
>
> This patch turns the maximum word length into a run-time option, rather
> than compile-time.
>
> --- htdig-3.1.2.bak/include/htconfig.h.in Wed Apr 21 21:47:58 1999
> +++ htdig-3.1.2/include/htconfig.h.in Wed Aug 4 10:43:33 1999
> @@ -5,7 +5,6 @@
> #define _config_h_
>
> #define VERSION 1
> -#define MAX_WORD_LENGTH 12
>
> /* Define if on AIX 3.
> System headers sometimes define this.
> --- htdig-3.1.2.bak/htcommon/WordReference.h Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htcommon/WordReference.h Wed Aug 4 10:44:12 1999
> @@ -25,7 +25,7 @@ public:
> WordReference() {}
> ~WordReference() {}
>
> - char Word[MAX_WORD_LENGTH + 1];
> + String Word;
> int WordCount;
> int Weight;
> int Location;
> --- htdig-3.1.2.bak/htcommon/WordList.cc Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htcommon/WordList.cc Wed Aug 4 12:22:31 1999
> @@ -46,11 +46,12 @@ void WordList::Word(char *word, int loca
> if (weight_factor == 0.0) // Why should we add words with no weight?
> return;
> String shortword = word;
> + static int maximum_word_length = config.Value("maximum_word_length", 12);
>
> shortword.lowercase();
> word = shortword.get();
> - if (shortword.length() > MAX_WORD_LENGTH)
> - word[MAX_WORD_LENGTH] = '\0';
> + if (shortword.length() > maximum_word_length)
> + word[maximum_word_length] = '\0';
>
> if (!valid_word(word))
> return;
> @@ -80,7 +81,7 @@ void WordList::Word(char *word, int loca
> wordRef->DocumentID = docID;
> wordRef->Weight = int((1000 - location) * weight_factor);
> wordRef->Anchor = anchor_number;
> - strcpy(wordRef->Word, word);
> + wordRef->Word = word;
> words->Add(word, wordRef);
> }
> }
> @@ -145,7 +146,7 @@ void WordList::Flush()
> while ((wordRef = (WordReference *) words->Get_NextElement()))
> {
>
> - fprintf(fl, "%s",wordRef->Word);
> + fprintf(fl, "%s",wordRef->Word.get());
> fprintf(fl, "\ti:%d\tl:%d\tw:%d",
> wordRef->DocumentID,
> wordRef->Location,
> @@ -220,15 +221,16 @@ void WordList::BadWordFile(char *filenam
> char buffer[1000];
> char *word;
> String new_word;
> - int minimum_word_length = config.Value("minimum_word_length", 3);
> + static int minimum_word_length = config.Value("minimum_word_length", 3);
> + static int maximum_word_length = config.Value("maximum_word_length", 12);
>
> while (fl && fgets(buffer, sizeof(buffer), fl))
> {
> word = strtok(buffer, "\r\n \t");
> if (word && *word)
> {
> - if (strlen(word) > MAX_WORD_LENGTH)
> - word[MAX_WORD_LENGTH] = '\0';
> + if (strlen(word) > maximum_word_length)
> + word[maximum_word_length] = '\0';
> new_word = word; // We need to clean it up before we add it
> new_word.lowercase(); // Just in case someone enters an odd one
> HtStripPunctuation(new_word);
> --- htdig-3.1.2.bak/htcommon/DocumentRef.cc Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htcommon/DocumentRef.cc Wed Aug 4 10:45:30 1999
> @@ -571,8 +571,7 @@ void DocumentRef::AddDescription(char *d
> static double description_factor = config.Double("description_factor");
> static int max_descriptions = config.Value("max_descriptions", 5);
>
> - // Not restricted to this size, just used as a hint.
> - String word(MAX_WORD_LENGTH);
> + String word;
>
> while (*p)
> {
> --- htdig-3.1.2.bak/htcommon/defaults.cc Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htcommon/defaults.cc Wed Aug 4 10:47:44 1999
> @@ -89,6 +89,7 @@ ConfigDefaults defaults[] =
> {"max_prefix_matches", "1000"},
> {"max_stars", "4"},
> {"maximum_pages", "10"},
> + {"maximum_word_length", "12"},
> {"metaphone_db", "${database_base}.metaphone.db"},
> {"meta_description_factor", "50"},
> {"method_names", "and All or Any boolean Boolean"},
> --- htdig-3.1.2.bak/htsearch/parser.cc Wed Apr 21 21:47:58 1999
> +++ htdig-3.1.2/htsearch/parser.cc Wed Aug 4 10:50:41 1999
> @@ -202,6 +202,7 @@ Parser::setError(char *expected)
> void
> Parser::perform_push()
> {
> + static int maximum_word_length = config.Value("maximum_word_length", 12);
> String temp = current->word.get();
> String data;
> char *p;
> @@ -220,8 +221,8 @@ Parser::perform_push()
> }
> temp.lowercase();
> p = temp.get();
> - if (temp.length() > MAX_WORD_LENGTH)
> - p[MAX_WORD_LENGTH] = '\0';
> + if (temp.length() > maximum_word_length)
> + p[maximum_word_length] = '\0';
> if (dbf->Get(p, data) == OK)
> {
> p = data.get();
> --- htdig-3.1.2.bak/htdoc/attrs.html Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdoc/attrs.html Wed Aug 4 10:58:59 1999
> @@ -3124,6 +3124,51 @@
> <hr>
> <dl>
> <dt>
> + <strong><a name="maximum_word_length">
> + maximum_word_length</a></strong>
> + </dt>
> + <dd>
> + <dl>
> + <dt>
> + <em>type:</em>
> + </dt>
> + <dd>
> + number
> + </dd>
> + <dt>
> + <em>used by:</em>
> + </dt>
> + <dd>
> + <a href="htdig.html">htdig</a> and
> + <a href="htsearch.html" target="_top">htsearch</a>
> + </dd>
> + <dt>
> + <em>default:</em>
> + </dt>
> + <dd>
> + 12
> + </dd>
> + <dt>
> + <em>description:</em>
> + </dt>
> + <dd>
> + This sets the maximum length of words that will be
> + indexed. Words longer than this value will be silently
> + truncated when put into the index, or searched in the
> + index.
> + </dd>
> + <dt>
> + <em>example:</em>
> + </dt>
> + <dd>
> + maximum_word_length: 15
> + </dd>
> + </dl>
> + </dd>
> + </dl>
> + <hr>
> + <dl>
> + <dt>
> <strong><a name="meta_description_factor">
> meta_description_factor</a></strong>
> </dt>
> --- htdig-3.1.2.bak/htdoc/cf_byname.html Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdoc/cf_byname.html Wed Aug 4 10:59:30 1999
> @@ -96,6 +96,7 @@
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#max_prefix_matches">max_prefix_matches</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#max_stars">max_stars</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#maximum_pages">maximum_pages</a><br>
> + <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#maximum_word_length">maximum_word_length</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#meta_description_factor">meta_description_factor</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#metaphone_db">metaphone_db</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#method_names">method_names</a><br>
> --- htdig-3.1.2.bak/htdoc/cf_byprog.html Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdoc/cf_byprog.html Wed Aug 4 11:00:31 1999
> @@ -54,6 +54,7 @@
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#max_head_length">max_head_length</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#max_hop_count">max_hop_count</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#max_meta_description_length">max_meta_description_length</a><br>
> + <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#maximum_word_length">maximum_word_length</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#meta_description_factor">meta_description_factor</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#minimum_word_length">minimum_word_length</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#modification_time_is_now">modification_time_is_now</a><br>
> @@ -132,6 +133,7 @@
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#max_prefix_matches">max_prefix_matches</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#max_stars">max_stars</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#maximum_pages">maximum_pages</a><br>
> + <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#maximum_word_length">maximum_word_length</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#method_names">method_names</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#minimum_prefix_length">minimum_prefix_length</a><br>
> <img src="dot.gif" alt="*" width=9 height=9> <a target="body" href="attrs.html#minimum_word_length">minimum_word_length</a><br>
>
> I think this patch will fix PR#514 in the bug database. It's Geoff's
> first patch, with a minor correction, plus an added test in the vscode
> macro, which is where the problem seemed to be happening. The author
> of the metaphone code likely assumed that isalpha() meant [A-Za-z],
> and forgot about upper half characters. This won't do anything to map
> accented vowels to their unaccented counterparts, but it should hopefully
> put an end to the segmentation faults.
>
> --- htdig-3.1.2.bak/htfuzzy/Fuzzy.cc Wed Apr 21 21:47:58 1999
> +++ htdig-3.1.2/htfuzzy/Fuzzy.cc Fri Jul 30 16:37:42 1999
> @@ -55,6 +55,8 @@ Fuzzy::getWords(char *word, List &words)
> {
> if (!index)
> return;
> + if (!word || !*word)
> + return;
>
> //
> // Convert the word to a fuzzy key
> --- htdig-3.1.2.bak/htfuzzy/Metaphone.cc Wed Apr 21 21:47:58 1999
> +++ htdig-3.1.2/htfuzzy/Metaphone.cc Tue Aug 3 14:50:06 1999
> @@ -51,7 +51,7 @@ static char vsvfn[26] = {
> /* N O P Q R S T U V W X Y Z */
>
> /* Macros to access character coding array */
> -#define vscode(x) (vsvfn[(x) - 'A'])
> +#define vscode(x) ((x) >= 'A' && (x) <= 'Z' ? vsvfn[(x) - 'A'] : 0)
> #define vowel(x) ((x) != '\0' && vscode(x) & 1) /* AEIOU */
> #define same(x) ((x) != '\0' && vscode(x) & 2) /* FJLMNR */
> #define varson(x) ((x) != '\0' && vscode(x) & 4) /* CGPST */
> @@ -63,6 +63,9 @@ static char vsvfn[26] = {
> void
> Metaphone::generateKey(char *word, String &key)
> {
> + if (!word || !*word)
> + return;
> +
> char *n;
> String ntrans;
>
>
> This patch fixes the bug in the handling of modification_time_is_now in
> the readHeader() function.
>
> --- htdig-3.1.2/htdig/Document.cc.modnowbug Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdig/Document.cc Fri Jul 30 13:39:18 1999
> @@ -96,10 +96,7 @@ Document::Reset()
> delete url;
> url = 0;
> referer = 0;
> - if(config.Boolean("modification_time_is_now"))
> - modtime = time(NULL);
> - else
> - modtime = 0;
> + modtime = 0;
>
> contents = 0;
> document_length = 0;
> @@ -463,10 +460,7 @@ Document::readHeader(Connection &c)
> int inHeader = 1;
> int returnStatus = Header_not_found;
>
> - if (config.Boolean("modification_time_is_now"))
> - modtime = time(NULL);
> - else
> - modtime = 0;
> + modtime = 0;
>
> while (inHeader)
> {
> @@ -542,6 +536,11 @@ Document::readHeader(Connection &c)
> }
> }
> }
> + static int modification_time_is_now =
> + config.Boolean("modification_time_is_now");
> + if (modtime == 0 && modification_time_is_now)
> + modtime = time(NULL);
> +
> if (debug > 2)
> cout << "returnStatus = " << returnStatus << endl;
> return returnStatus;
>
> This patch fixes <meta> robots parsing to allow multiple directives
> to work correctly. Fixes PR#578, as provided by Chris Liddiard
> <c.h.liddiard@qmw.ac.uk>.
>
> --- htdig-3.1.2/htdig/HTML.cc.robotbug Fri Jul 30 12:24:14 1999
> +++ htdig-3.1.2/htdig/HTML.cc Fri Jul 30 13:28:35 1999
> @@ -873,9 +873,9 @@ HTML::do_tag(Retriever &retriever, Strin
> doindex = 0;
> retriever.got_noindex();
> }
> - else if (content_cache.indexOf("nofollow") != -1)
> + if (content_cache.indexOf("nofollow") != -1)
> dofollow = 0;
> - else if (content_cache.indexOf("none") != -1)
> + if (content_cache.indexOf("none") != -1)
> {
> doindex = 0;
> dofollow = 0;
>
> This patch fixes PR#572, where htsearch crashed if CONTENT_LENGTH was not set
> but REQUEST_METHOD was.
>
> --- htdig-3.1.2.bak/htlib/cgi.cc Wed Apr 21 21:47:58 1999
> +++ htdig-3.1.2/htlib/cgi.cc Wed Aug 4 16:51:49 1999
> @@ -67,7 +67,9 @@
> int n;
> char *buf;
>
> - n = atoi(getenv("CONTENT_LENGTH"));
> + buf = getenv("CONTENT_LENGTH");
> + if (!buf || !*buf || (n = atoi(buf)) <= 0)
> + return; // null query
> buf = new char[n + 1];
> read(0, buf, n);
> buf[n] = '\0';
>
> This patch adds error messages for unknown hosts.
>
> --- htdig-3.1.2/htdig/Document.cc.nohostmsg Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdig/Document.cc Fri Jul 30 13:48:03 1999
> @@ -301,14 +301,22 @@ Document::RetrieveHTTP(time_t date)
> if (c.assign_port(proxy->port()) == NOTOK)
> return Document_not_found;
> if (c.assign_server(proxy->host()) == NOTOK)
> + {
> + if (debug)
> + cout << "Unknown proxy host: " << proxy->host() << endl;
> return Document_no_host;
> + }
> }
> else
> {
> if (c.assign_port(url->port()) == NOTOK)
> return Document_not_found;
> if (c.assign_server(url->host()) == NOTOK)
> + {
> + if (debug)
> + cout << "Unknown host: " << proxy->host() << endl;
> return Document_no_host;
> + }
> }
>
> if (c.connect(1) == NOTOK)
>
> This patch fixes a bug in the PDF parser. When the Title header was just
> the temporary file name, it wouldn't be used, but it also wouldn't be cleared
> from the _parsedString variable, so it ended up polluting the document
> excerpt.
>
> --- htdig-3.1.2/htdig/PDF.cc.orig Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdig/PDF.cc Tue May 25 12:01:43 1999
> @@ -290,8 +290,8 @@ void PDF::parseNonTextLine(String &line)
> _parsedString.get());
>
> _retriever->got_title(_parsedString);
> - _parsedString = 0;
> }
> + _parsedString = 0;
> }
>
> }
>
> This fixes the infamous problem with files like left_index.html not getting
> indexed. PR#543 & PR#585.
>
> --- htdig-3.1.2/htlib/URL.cc.orig Wed Apr 21 21:47:58 1999
> +++ htdig-3.1.2/htlib/URL.cc Fri Jun 11 12:24:40 1999
> @@ -440,7 +440,7 @@ void URL::removeIndex(String &path)
> l.Release();
> }
> if (defaultdoc->hasPattern() &&
> - defaultdoc->FindFirstWord(path.sub(filename)) >= 0)
> + defaultdoc->CompareWord(path.sub(filename)))
> path.chop(path.length() - filename);
> }
>
>
> Fix server_alias entries so port defaults to 80 if omitted.
>
> --- htdig-3.1.2/htlib/URL.cc.old Fri Jul 30 14:51:32 1999
> +++ htdig-3.1.2/htlib/URL.cc Fri Jul 30 16:57:35 1999
> @@ -540,6 +540,11 @@ char *URL::signature()
> }
>
>
> +//*****************************************************************************
> +// void URL::ServerAlias()
> +// Takes care of the server aliases, which attempt to simplify virtual
> +// host problems
> +//
> void URL::ServerAlias()
> {
> static Dictionary *serveraliases= 0;
> @@ -547,6 +552,7 @@ void URL::ServerAlias()
> if (! serveraliases)
> {
> String l= config["server_aliases"];
> + String from, *to;
> serveraliases = new Dictionary();
> char *p = strtok(l, " \t");
> char *salias= NULL;
> @@ -556,7 +562,13 @@ void URL::ServerAlias()
> if (! salias)
> continue;
> *salias++= '\0';
> - serveraliases->Add(p, new String(salias));
> + from = p;
> + if (from.indexOf(':') == -1)
> + from.append(":80");
> + to= new String(salias);
> + if (to->indexOf(':') == -1)
> + to->append(":80");
> + serveraliases->Add(from.get(), to);
> // cout << "Alias: " << p << "->" << salias << "\n";
> // printf ("Alias: %s->%s\n", p, salias);
> p = strtok(0, " \t");
>
> This patch fixes the HTML parser to decode SGML entities within tag attributes.
>
> --- htdig-3.1.2.bak/htdig/HTML.h Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdig/HTML.h Fri Jul 30 12:23:25 1999
> @@ -72,6 +72,7 @@ private:
> // Helper functions
> //
> void do_tag(Retriever &, String &);
> + char *transSGML(char *);
> };
>
> #endif
> --- htdig-3.1.2.bak/htdig/HTML.cc Wed Apr 21 21:47:57 1999
> +++ htdig-3.1.2/htdig/HTML.cc Fri Jul 30 16:22:55 1999
> @@ -544,7 +544,7 @@ HTML::do_tag(Retriever &retriever, Strin
> in_ref = 0;
> }
> delete href;
> - href = new URL(position, *base);
> + href = new URL(transSGML(position), *base);
> in_ref = 1;
> description = 0;
> position = q + 1;
> @@ -595,7 +595,7 @@ HTML::do_tag(Retriever &retriever, Strin
> q++;
> *q = '\0';
> }
> - retriever.got_anchor(position);
> + retriever.got_anchor(transSGML(position));
> position = q + 1;
> break;
> }
> @@ -704,7 +704,7 @@ HTML::do_tag(Retriever &retriever, Strin
> q++;
> *q = '\0';
> }
> - retriever.got_image(position);
> + retriever.got_image(transSGML(position));
> break;
> }
>
> @@ -736,15 +736,15 @@ HTML::do_tag(Retriever &retriever, Strin
> }
> if (conf["htdig-email"])
> {
> - retriever.got_meta_email(conf["htdig-email"]);
> + retriever.got_meta_email(transSGML(conf["htdig-email"]));
> }
> if (conf["htdig-notification-date"])
> {
> - retriever.got_meta_notification(conf["htdig-notification-date"]);
> + retriever.got_meta_notification(transSGML(conf["htdig-notification-date"]));
> }
> if (conf["htdig-email-subject"])
> {
> - retriever.got_meta_subject(conf["htdig-email-subject"]);
> + retriever.got_meta_subject(transSGML(conf["htdig-email-subject"]));
> }
> if (conf["htdig-keywords"] || conf["keywords"])
> {
> @@ -757,7 +757,7 @@ HTML::do_tag(Retriever &retriever, Strin
> char *keywords = conf["htdig-keywords"];
> if (!keywords)
> keywords = conf["keywords"];
> - char *w = strtok(keywords, " ,\t\r\n");
> + char *w = strtok(transSGML(keywords), " ,\t\r\n");
> while (w)
> {
> if (strlen(w) >= minimumWordLength)
> @@ -783,7 +783,7 @@ HTML::do_tag(Retriever &retriever, Strin
> while (*qq && (*qq != ';') && (*qq != '"') &&
> !isspace(*qq))qq++;
> *qq = 0;
> - URL *href = new URL(q, *base);
> + URL *href = new URL(transSGML(q), *base);
> // I don't know why anyone would do this, but hey...
> if (dofollow)
> retriever.got_href(*href, "");
> @@ -811,7 +811,7 @@ HTML::do_tag(Retriever &retriever, Strin
> //
> // We need to do two things. First grab the description
> //
> - meta_dsc = conf["content"];
> + meta_dsc = transSGML(conf["content"]);
> if (meta_dsc.length() > max_meta_description_length)
> meta_dsc = meta_dsc.sub(0, max_meta_description_length).get();
> if (debug > 1)
> @@ -824,7 +824,7 @@ HTML::do_tag(Retriever &retriever, Strin
> // (slot 11 is the new slot for this)
> //
>
> - char *w = strtok(conf["content"], " \t\r\n");
> + char *w = strtok(transSGML(conf["content"]), " \t\r\n");
> while (w)
> {
> if (strlen(w) >= minimumWordLength)
> @@ -836,7 +836,7 @@ HTML::do_tag(Retriever &retriever, Strin
>
> if (keywordsMatch.CompareWord(cache))
> {
> - char *w = strtok(conf["content"], " ,\t\r\n");
> + char *w = strtok(transSGML(conf["content"]), " ,\t\r\n");
> while (w)
> {
> if (strlen(w) >= minimumWordLength)
> @@ -847,15 +847,15 @@ HTML::do_tag(Retriever &retriever, Strin
> }
> else if (mystrcasecmp(cache, "htdig-email") == 0)
> {
> - retriever.got_meta_email(conf["content"]);
> + retriever.got_meta_email(transSGML(conf["content"]));
> }
> else if (mystrcasecmp(cache, "htdig-notification-date") == 0)
> {
> - retriever.got_meta_notification(conf["content"]);
> + retriever.got_meta_notification(transSGML(conf["content"]));
> }
> else if (mystrcasecmp(cache, "htdig-email-subject") == 0)
> {
> - retriever.got_meta_subject(conf["content"]);
> + retriever.got_meta_subject(transSGML(conf["content"]));
> }
> else if (mystrcasecmp(cache, "htdig-noindex") == 0)
> {
> @@ -948,7 +948,7 @@ HTML::do_tag(Retriever &retriever, Strin
> *q = '\0';
> }
> delete href;
> - href = new URL(position, *base);
> + href = new URL(transSGML(position), *base);
> if (dofollow)
> {
> description = 0;
> @@ -1016,7 +1016,7 @@ HTML::do_tag(Retriever &retriever, Strin
> *q = '\0';
> }
> delete href;
> - href = new URL(position, *base);
> + href = new URL(transSGML(position), *base);
> if (dofollow)
> {
> description = 0;
> @@ -1085,7 +1085,7 @@ HTML::do_tag(Retriever &retriever, Strin
> q++;
> *q = '\0';
> }
> - URL tempBase(position, *base);
> + URL tempBase(transSGML(position), *base);
> *base = tempBase;
> }
> }
> @@ -1095,4 +1095,25 @@ HTML::do_tag(Retriever &retriever, Strin
> default:
> return; // Nothing...
> }
> +}
> +
> +
> +//*****************************************************************************
> +// char * HTML::transSGML(char *text)
> +//
> +char *
> +HTML::transSGML(char *str)
> +{
> + static String convert;
> + unsigned char *text = (unsigned char *)str;
> +
> + convert = 0;
> + while (*text)
> + {
> + if (*text == '&')
> + convert << SGMLEntities::translateAndUpdate(text);
> + else
> + convert << *text++;
> + }
> + return convert.get();
> }
>
> Fix PR#566 by setting the correct length of the string being
> matched. 'http://' is 7 characters. Submitted by
> <wolfgang.pichler@creditanstalt.co.at>.
>
> --- htdig-3.1.2.bak/htlib/URL.cc Wed Apr 21 21:47:58 1999
> +++ htdig-3.1.2/htlib/URL.cc Fri Jul 30 14:51:32 1999
> @@ -130,7 +130,7 @@ URL::URL(char *ref, URL &parent)
> while (isalpha(*p))
> p++;
> int hasService = (*p == ':');
> - if (hasService && ((strncmp(ref, "http://", 6) == 0) ||
> + if (hasService && ((strncmp(ref, "http://", 7) == 0) ||
> (strncmp(ref, "http:", 5) != 0)))
> {
> //
>
> Fixes problem with $(VAR) at end of template string not being expanded.
>
> --- htdig-3.1.2/htsearch/Display.cc.varstatebug Fri Jul 30 14:24:05 1999
> +++ htdig-3.1.2/htsearch/Display.cc Fri Jul 30 15:25:09 1999
> @@ -822,7 +822,7 @@ Display::expandVariables(char *str)
> }
> str++;
> }
> - if (state == 5)
> + if (state == 2 || state == 5)
> {
> //
> // The end of string was reached, but we are still trying to
>
> -------- 8< -------- snip -------- 8< --------
>
> --
> Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
>
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> htdig@htdig.org containing the single word unsubscribe in
> the SUBJECT of the message.
>

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word unsubscribe in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Mon Aug 09 1999 - 08:53:04 PDT