Re: [htdig] problems with the "accent" patch


Subject: Re: [htdig] problems with the "accent" patch
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Mar 02 2000 - 14:28:08 PST


According to Robert Marchand:
> At 22:12 00-03-02 +0100, Eric van der Vlist wrote:
> >I have applied this patch as well and noticed that it's working for most
> >of the words, but not for others...
> >
> >Looking at the output of "htfuzzy -vv accents", I have noticed that all
> >the words are truncated to 12 characters and that the words which are
> >truncated are those for which there is a problem.
> >
> >For instance searching for "enchere" (not truncated) will return the
> >matching for the correctly spelled word (with è) while searching
> >for "specification" truncated to "specificatio" will not match
> >specification with a é.
> >
> >If I search for "specificatio", I do get the matching for the
> >accentuated word...
>
> Yes, I check myself with "préférablement" and the accents algorithm
> doesn't work in that case.
>
> This was something I was thinking to verify. The default is 12 caracters.
> Here, we were to take it to 18 or 24, so it was less a priority for me.
> I will add a correction to have accents keys in sync with the
> maximum_word_length parameter.

Yes, as Joe pointed out, increasing maximum_word_length (and of course
reindexing) would side-step this problem, and for your purposes it
may be the best solution. One of the main reasons I made this a config
attribute, instead of the compile-time constant it used to be, is because
the default is inappropriate for most non-English languages.

I looked into adding this correction to the code myself, but I'm having
a hard time finding the right spot to do it. Technically, this would
potentially be a problem for any fuzzy algorithm that uses the word
database as its source of words, because these words are all truncated.
However, algorithms like synonyms and endings would not use truncated
words, so they probably should be left alone. I don't think they'd
function well if fed truncated words. So, probably the best place to
do this would be in the generateKey() method or getWords() method of
the fuzzy algorithms that use the word database. The problem is that
in 3.1.x, those methods don't have access to the config object, so you'd
probably need to change the code in several spots in order to accomodate
this fix. I think this would be much easier in 3.2, so maybe until then,
the best approach is just to crank up maximum_word_length.

If we do fix this, even if only in 3.2, it also opens the question of which
algorithms will need to be fixed. Accents and Metaphone are the first two
that come to mind. Soundex is probably not a problem because it uses a
maximum key length of 6 anyway. I wouldn't expect Substring and Prefix
to pose a problem either, because the user is unlikely to specify such a
large key (I could be wrong, though). The regex and speling algorithms
added to 3.2 may also need revisions.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu Mar 02 2000 - 14:32:34 PST