[htdig] Clarification on compound word handling


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 7 Oct 1999 13:21:36 -0500 (CDT)


On Thu, 23 Sep 1999, J. op den Brouw (msql@st.hhs.nl) wrote:
> Some questions
>
> On Wed, 22 Sep 1999, Geoff Hutchison wrote:
>
> > * When indexing, htdig should now attempt to index compound words as
> > separate words in addition to a compound word. For example,
> > "pdf_parser" would also be indexed as "pdf" and "parser."
> > * Once again, thanks to everyone who reported bugs and bug fixes.
>
> How does htdig know that the _ is a word splitter, and is a . (dot) also
> a word splitter.....
>
> The valid_puctuation removes these characters from a word, is it not?

My compound word fix uses any character in valid_punctuation as a word
separator. It does this before the punctuation is stripped out of the
words. When it encounters compound words, with the words separated by
valid punctuation characters, it puts the entire word in the database,
as it did before, but now it also adds all combinations of parts. Of
course, it strips off all punctuation before adding any word or part
to the database.

Here's the write-up I had when I first posted the patch to the list:

   This patch improves htdig's handling of compound words, like post-doctoral
   and such, to add each individual part, as well as the whole, into the word
   database. This allows searches for individual parts, like "doctoral", to
   find those parts in hyphenated (or otherwise punctuated) compound words.
   It should also fix the problem with "d'" in French text. The code seems
   quite convoluted because it's designed to handle all the combinations of
   parts in multi-hyphen-compound-words.

To expand on that last example, here are the words it'll add to the
database (which will get truncated to maximum_word_length characters):

   multihyphencompoundwords
   multi
   hyphen
   compound
   words
   multihyphen
   hyphencompound
   compoundwords
   multihyphencompound
   hyphencompoundwords

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Oct 07 1999 - 11:29:09 PDT