[htdig] Re: extra_word_characters (PR#952)


Subject: [htdig] Re: extra_word_characters (PR#952)
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Fri Nov 24 2000 - 08:10:41 PST


According to Tomas Frydrych (tomas@frydrych.freeserve.co.uk):
> Version: 3.1.5
>
> I need to add '+' to the list of valid word characters; after doing so htdig
> will index all words that contain '+' inside, but refuses to index words that
> start with '+' (and I suspect also words that end with it).

OK, I was able to reproduce the problem after all. I had limited my tests
before to htdig only, but the problem was in htmerge. It gives special
meaning to lines in the db.wordlist file that begin with "+", "-", and
"!", to mark document IDs that are unchanged, discarded or superceded.
Trouble is htmerge reads the wordlist assuming a valid word would never
begin with one of these, so its test for these is too liberal. Here's
a patch to correct the problem, so that you can add any of these three
special characters to extra_word_characters and allow words that begin
with one of them. Apply it in the htdig-3.1.5 main source directory using
"patch -p0 < this-message-file".

--- htmerge/words.cc.wordbug Thu Feb 24 20:29:11 2000
+++ htmerge/words.cc Fri Nov 24 09:54:27 2000
@@ -74,37 +74,40 @@ mergeWords(char *wordtmp, char *wordfile
     //
     while (fgets(buffer, sizeof(buffer), sorted))
     {
- if (*buffer == '+')
+ //
+ // Split the line up into the word, count, location, and
+ // document id.
+ //
+ word = good_strtok(buffer, '\t');
+ pair = good_strtok(NULL, '\t');
+ if (!word || !*word || !pair || !*pair)
         {
+ if (*buffer == '+')
+ {
             //
             // This tells us that the document hasn't changed and we
             // are to reuse the old words
             //
- }
- else if (*buffer == '-')
- {
+ }
+ else if (*buffer == '-')
+ {
              if (removeBadUrls)
             {
                 discard_list.Add(strtok(buffer + 1, "\n"), 0);
                 if (verbose)
                     cout << "htmerge: Removing doc #" << buffer + 1 << endl;
             }
- }
- else if (*buffer == '!')
- {
+ }
+ else if (*buffer == '!')
+ {
             discard_list.Add(strtok(buffer + 1, "\n"), 0);
             if (verbose)
                 cout << "htmerge: doc #" << buffer + 1 <<
                     " has been superceeded." << endl;
+ }
         }
         else
         {
- //
- // Split the line up into the word, count, location, and
- // document id.
- //
- word = good_strtok(buffer, '\t');
- pair = good_strtok(NULL, '\t');
             wr.Clear(); // Reset count to 1, anchor to 0, and all that
             sid = "-";
             while (pair && *pair)

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Fri Nov 24 2000 - 08:19:05 PST