Subject: Re: AW: [htdig] Valid Punctiation Question
From: Gilles Detillieux (firstname.lastname@example.org)
Date: Thu Oct 26 2000 - 09:23:42 PDT
According to Reich, Stefan:
> with extra_word_characters I have a different Problem.
> On one hand I want 1998-10-11 to be treated like 19981011. So The document
> contains 1998-10-11 and the search for 19981011 shuld give me a result too.
> (-> Valid Punctuation)
> On the other hand I want a result only if I search for the full string and
> no match for 1998.
> My dilemma: Valid Punctuation strips the - but splits the string too
> Extra Word Characters doesn't split the string, but doesn't
> remove the -
> So is there an option to have a combination of both ????
> I solved the problem in a different way now, but would be good to know if
> there is another option.
What you're running into is the compound word handling feature that
I added in 3.1.3, so that words like post-doctoral get indexed as
"postdoctoral", as before, but now also as "post" and "doctoral",
and "valid_puctuation" is also indexed as "valid" and "punctuation".
There's no way to turn that off right now, other than to patch
Retriever::got_word() in htdig/Retriever.cc, so it doesn't split up
words containing valid_punctuation characters. That may be radical,
as in most cases this feature is quite desirable.
If you want to make an exception for numbers only, you could change this
if (strcmp(word, w.get()) != 0) // have punctuation that was stripped
if (strcmp(word, w.get()) != 0 // have punctuation that was stripped
&& !isdigit(word)) // and it's not just a number
You might need to add a '#include <ctype.h>' to Retriever.cc for this.
If you want the exception to apply to words with digits anywhere, rather
than just at the start, you'd need to setup a little search loop above
this "if" statement to find the first digit.
Just out of curiosity, what other way did you use to solve this problem?
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>
This archive was generated by hypermail 2b28 : Thu Oct 26 2000 - 09:29:31 PDT