Re: AW: [htdig] Valid Punctiation Question


Subject: Re: AW: [htdig] Valid Punctiation Question
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Oct 26 2000 - 09:23:42 PDT


According to Reich, Stefan:
> with extra_word_characters I have a different Problem.
>
> On one hand I want 1998-10-11 to be treated like 19981011. So The document
> contains 1998-10-11 and the search for 19981011 shuld give me a result too.
> (-> Valid Punctuation)
>
> On the other hand I want a result only if I search for the full string and
> no match for 1998.
>
> My dilemma: Valid Punctuation strips the - but splits the string too
> Extra Word Characters doesn't split the string, but doesn't
> remove the -
>
> So is there an option to have a combination of both ????
>
> I solved the problem in a different way now, but would be good to know if
> there is another option.

What you're running into is the compound word handling feature that
I added in 3.1.3, so that words like post-doctoral get indexed as
"postdoctoral", as before, but now also as "post" and "doctoral",
and "valid_puctuation" is also indexed as "valid" and "punctuation".
There's no way to turn that off right now, other than to patch
Retriever::got_word() in htdig/Retriever.cc, so it doesn't split up
words containing valid_punctuation characters. That may be radical,
as in most cases this feature is quite desirable.

If you want to make an exception for numbers only, you could change this
line:

      if (strcmp(word, w.get()) != 0) // have punctuation that was stripped

to this:

      if (strcmp(word, w.get()) != 0 // have punctuation that was stripped
                && !isdigit(word[0])) // and it's not just a number

You might need to add a '#include <ctype.h>' to Retriever.cc for this.
If you want the exception to apply to words with digits anywhere, rather
than just at the start, you'd need to setup a little search loop above
this "if" statement to find the first digit.

Just out of curiosity, what other way did you use to solve this problem?

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Thu Oct 26 2000 - 09:29:31 PDT