[htdig] patch for htsearch excerpt highlighting


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Mon, 23 Aug 1999 17:16:27 -0500 (CDT)


Hi, folks. A frequent complaint is that htsearch doesn't always find
matched words in the document excerpts, so they don't get highlighted.
Also, in this case if you don't have no_excerpt_show_top set to true,
you get the infamous message:

  (None of the search words were found in the top of this document.)

What I've found is a reasonably common cause of this is that before
putting words into the database, all punctuation characters in
the valid_punctuation list is stripped from the words, so a search
for the word will match whether the word has the punctuation or not.
Unfortunately, the excerpt highlighting does not similary strip or ignore
punctuation in the excerpt, so it misses matches that are in the word
database, hence the apparent lack of matches that a lot of users have
complained about.

E.g., if you search for "email", htsearch will find all documents that
contain "email" or "e-mail", but only the string "email" will be matched
and highlighted in the document excerpt.

Here's a patch that adds a new StringMatch::IgnorePunct() method, and
uses it in htsearch to highlight matches whether or not they contain
punctuation. I must admit that the whole StringMatch class is still a
bit of a mystery to me, so I'd like others to look over and test this
patch, and tell me if they find anything wrong with it, or whether my
approach is less than optimal. I don't think I broke anything with it,
but please let me know if you find otherwise.

I guess the next step will be to put extra entries in the dictionary
for hyphenated words, so that for example a word like part-time can be
matched simply with "part" or "time".

--- htdig-3.1.2.bak/htlib/StringMatch.h Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htlib/StringMatch.h Mon Aug 23 15:38:31 1999
@@ -98,6 +98,12 @@ public:
     void IgnoreCase();
 
     //
+ // Build a local translation table which ignores all given punctuation
+ // characters
+ //
+ void IgnorePunct(char *punct = NULL);
+
+ //
     // Determine if there is a pattern associated with this Match object.
     //
     int hasPattern() {return table[0] != 0;}
--- htdig-3.1.2.bak/htlib/StringMatch.cc Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htlib/StringMatch.cc Mon Aug 23 16:40:14 1999
@@ -90,6 +90,8 @@ StringMatch::Pattern(char *pattern, char
         table[i] = new int[n];
         memset((unsigned char *) table[i], 0, n * sizeof(int));
     }
+ for (i = 0; i < n; i++)
+ table[0][i] = i; // "no-op" states for null char, to be ignored
 
     //
     // Set up a standard case translation table if needed.
@@ -127,6 +129,11 @@ StringMatch::Pattern(char *pattern, char
 #endif
 
         chr = trans[(unsigned char)*pattern];
+ if (chr == 0)
+ {
+ pattern++;
+ continue;
+ }
         if (chr == sep)
         {
             //
@@ -504,12 +511,39 @@ void StringMatch::TranslationTable(char
 //
 void StringMatch::IgnoreCase()
 {
- if (local_alloc)
- delete [] trans;
- trans = new unsigned char[256];
+ if (!local_alloc || !trans)
+ {
+ trans = new unsigned char[256];
+ for (int i = 0; i < 256; i++)
+ trans[i] = (unsigned char)i;
+ local_alloc = 1;
+ }
     for (int i = 0; i < 256; i++)
- trans[i] = tolower((unsigned char)i);
- local_alloc = 1;
+ if (isupper((unsigned char)i))
+ trans[i] = tolower((unsigned char)i);
+}
+
+
+//*****************************************************************************
+// void StringMatch::IgnorePunct(char *punct)
+// Set up the character translation table to ignore punctuation
+//
+void StringMatch::IgnorePunct(char *punct)
+{
+ if (!local_alloc || !trans)
+ {
+ trans = new unsigned char[256];
+ for (int i = 0; i < 256; i++)
+ trans[i] = (unsigned char)i;
+ local_alloc = 1;
+ }
+ if (punct)
+ for (int i = 0; punct[i]; i++)
+ trans[(unsigned char)punct[i]] = 0;
+ else
+ for (int i = 0; i < 256; i++)
+ if (HtIsWordChar(i) && !HtIsStrictWordChar(i))
+ trans[i] = 0;
 }
 
 
--- htdig-3.1.2.bak/htsearch/htsearch.cc Wed Aug 18 16:40:30 1999
+++ htdig-3.1.2/htsearch/htsearch.cc Mon Aug 23 15:42:44 1999
@@ -222,6 +222,7 @@ main(int ac, char **av)
     //
     origPattern += logicalPattern;
     searchWordsPattern.IgnoreCase();
+ searchWordsPattern.IgnorePunct();
     searchWordsPattern.Pattern(origPattern);
     if (debug > 2)
       cout << "Excerpt pattern: " << origPattern << "\n";

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Mon Aug 23 1999 - 15:18:04 PDT