Re: htdig: I'll == ill ???


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 3 Dec 1998 12:49:39 -0600 (CST)


According to Geoff Hutchison:
>
> At 7:31 PM -0500 12/2/98, Gilles Detillieux wrote:
> >but when trying to find the word in the document text, for the excerpt,
> >it doesn't. It does a case insensitive search for the first matching
> >word, but when looking for "ill" it doesn't match "I'll" in the text.
> >
> >Ah, well, I don't think it's that big a deal. Not enough to rewrite the
> >way Display::excerpt() searches for the word.
>
> I'd like to see Display::excerpt() make use of the location field for the
> word. Why are we doing a search for the first matching word when we've
> indexed the location of the first occurrence of every word? :-)

Does the location field record the word's location in the original
document, before the HTML tags are stripped out? If so, it seems it
would not help a lot in finding the word in the bare text of the excerpt.

> The reason for the asymmetry probably has something to do with
> valid_punctuation.

Yes, it would seem. If I search for "I'll", it still ends up searching
for "(ill or illness or ills)", and Display::excerpt() looks for
these words in the excerpt, which it doesn't find, presumably because
StringMatch::FindFirstWord() doesn't ignore "valid_punctuation" in the
text string it's looking through.

> >In my case, I have max_head_length set to 50000, and the matched documents
> >are all smaller than that, so the word is in the excerpt, but isn't being
> >found.
>
> I think you mean to say displayed? After all, I think you're saying it
> found a document with "I'll" in it, right?

It finds the documents that contain "I'll", as it ought to, but
StringMatch::FindFirstWord() fails to find the "I'll" in the excerpt
because it's looking for ill or illness or ills, so the excerpt isn't
displayed.

Just as the StringMatch stuff alters the state tables based on whether
you're ignoring case or not, it would somehow need to be patched to
give you the option of ignoring punctuation too, at least for this
particular case. I'm just not clear enough on how the state tables are
set up to figure out how to add this myself.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:46 PST