[htdig] patch for improved compound word handling in htdig


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 25 Aug 1999 16:08:55 -0500 (CDT)


Finally, after much anticipation and fanfare (OK, maybe not :), here
is my compound word patch. I'd appreciate any feedback from those who
try it out. I'd recommend applying my previous patches before applying
this one, especially the excerpt highlighting patch I posted yesterday,
which is sort of a companion to this one. Of course, this patch won't
have any effect until you re-index.

This patch improves htdig's handling of compound words, like post-doctoral
and such, to add each individual part, as well as the whole, into the word
database. This allows searches for individual parts, like "doctoral", to
find those parts in hyphenated (or otherwise punctuated) compound words.
It should also fix the problem with "d'" in French text. The code seems
quite convoluted because it's designed to handle all the combinations of
parts in multi-hyphen-compound-words.

--- htdig-3.1.2.bak/htdig/Retriever.cc Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdig/Retriever.cc Wed Aug 25 15:36:12 1999
@@ -879,6 +879,56 @@ Retriever::got_word(char *word, int loca
       HtStripPunctuation(w);
       if (w.length() >= minimumWordLength)
         words.Word(w, location, current_anchor_number, factor[heading]);
+ if (strcmp(word, w.get()) != 0) // have punctuation that was stripped
+ {
+ // Check for compound words...
+ String parts = word;
+ int added;
+ int nparts = 1;
+ do
+ {
+ added = 0;
+ char *start = parts.get();
+ char *punctp, *nextp, *p;
+ char punct;
+ int n;
+ while (*start)
+ {
+ p = start;
+ for (n = 0; n < nparts; n++)
+ {
+ while (HtIsStrictWordChar((unsigned char)*p))
+ p++;
+ punctp = p;
+ if (!*punctp && n+1 < nparts)
+ break;
+ while (*p && !HtIsStrictWordChar((unsigned char)*p))
+ p++;
+ if (n == 0)
+ nextp = p;
+ }
+ if (n < nparts)
+ break;
+ punct = *punctp;
+ *punctp = '\0';
+ if (*start && (*p || start > parts.get()))
+ {
+ w = start;
+ HtStripPunctuation(w);
+ if (w.length() >= minimumWordLength)
+ {
+ words.Word(w, location, current_anchor_number, factor[heading]);
+ if (debug > 3)
+ cout << "word part: " << start << '@' << location << endl;
+ }
+ added++;
+ }
+ start = nextp;
+ *punctp = punct;
+ }
+ nparts++;
+ } while (added > 2);
+ }
     }
 }
 
--- htdig-3.1.2.bak/htdig/PDF.cc Wed Aug 18 16:40:30 1999
+++ htdig-3.1.2/htdig/PDF.cc Wed Aug 25 15:41:01 1999
@@ -525,16 +525,11 @@ void PDF::parseString()
 
             if (word.length() >= minimumWordLength)
             {
- word.lowercase();
- HtStripPunctuation(word);
- if (word.length() >= minimumWordLength)
- {
- _retriever->got_word(word,
- int(_curPage * 1000 / _pages),
- 0);
- if (debug > 3)
- printf("PDF::parseString: got word %s\n", word.get());
- }
+ _retriever->got_word(word,
+ int(_curPage * 1000 / _pages),
+ 0);
+ if (debug > 3)
+ printf("PDF::parseString: got word %s\n", word.get());
             }
         }
                 
--- htdig-3.1.2.bak/htdig/Plaintext.cc Wed Apr 21 21:47:57 1999
+++ htdig-3.1.2/htdig/Plaintext.cc Wed Aug 25 15:40:13 1999
@@ -72,14 +72,9 @@ Plaintext::parse(Retriever &retriever, U
 
             if (word.length() >= minimumWordLength)
             {
- word.lowercase();
- HtStripPunctuation(word);
- if (word.length() >= minimumWordLength)
- {
- retriever.got_word(word,
- int(offset * 1000 / contents->length()),
- 0);
- }
+ retriever.got_word(word,
+ int(offset * 1000 / contents->length()),
+ 0);
             }
         }
                 

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Wed Aug 25 1999 - 14:10:52 PDT