Re: [htdig] problems with the "accent" patch


Subject: Re: [htdig] problems with the "accent" patch
From: Eric van der Vlist (vdv@dyomedea.com)
Date: Thu Mar 02 2000 - 14:42:58 PST


Gilles Detillieux wrote:
>
> According to Robert Marchand:
> > At 22:12 00-03-02 +0100, Eric van der Vlist wrote:
> > >I have applied this patch as well and noticed that it's working for most
> > >of the words, but not for others...
> > >
> > >Looking at the output of "htfuzzy -vv accents", I have noticed that all
> > >the words are truncated to 12 characters and that the words which are
> > >truncated are those for which there is a problem.
> > >
> > >For instance searching for "enchere" (not truncated) will return the
> > >matching for the correctly spelled word (with è) while searching
> > >for "specification" truncated to "specificatio" will not match
> > >specification with a é.
> > >
> > >If I search for "specificatio", I do get the matching for the
> > >accentuated word...
> >
> > Yes, I check myself with "préférablement" and the accents algorithm
> > doesn't work in that case.
> >
> > This was something I was thinking to verify. The default is 12 caracters.
> > Here, we were to take it to 18 or 24, so it was less a priority for me.
> > I will add a correction to have accents keys in sync with the
> > maximum_word_length parameter.
>
> Yes, as Joe pointed out, increasing maximum_word_length (and of course
> reindexing) would side-step this problem, and for your purposes it
> may be the best solution. One of the main reasons I made this a config
> attribute, instead of the compile-time constant it used to be, is because
> the default is inappropriate for most non-English languages.

The danger (and the pleasure) to use Open Source is that you often go to
the source code too early ! I should have searched more into the docs
first ;=)
 
> I looked into adding this correction to the code myself, but I'm having
> a hard time finding the right spot to do it. Technically, this would
> potentially be a problem for any fuzzy algorithm that uses the word
> database as its source of words, because these words are all truncated.
> However, algorithms like synonyms and endings would not use truncated
> words, so they probably should be left alone. I don't think they'd
> function well if fed truncated words. So, probably the best place to
> do this would be in the generateKey() method or getWords() method of
> the fuzzy algorithms that use the word database. The problem is that
> in 3.1.x, those methods don't have access to the config object, so you'd
> probably need to change the code in several spots in order to accomodate
> this fix. I think this would be much easier in 3.2, so maybe until then,
> the best approach is just to crank up maximum_word_length.

Yes, it's a good workaround which is working (I can confirm it).

> If we do fix this, even if only in 3.2, it also opens the question of which
> algorithms will need to be fixed. Accents and Metaphone are the first two
> that come to mind. Soundex is probably not a problem because it uses a
> maximum key length of 6 anyway. I wouldn't expect Substring and Prefix
> to pose a problem either, because the user is unlikely to specify such a
> large key (I could be wrong, though). The regex and speling algorithms
> added to 3.2 may also need revisions.
>

Thanks (for your answer and above all for this nice product).

Eric

-- 
------------------------------------------------------------------------
Eric van der Vlist                                              Dyomedea

http://www.dyomedea.com http://www.ducotede.com ------------------------------------------------------------------------

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu Mar 02 2000 - 14:46:49 PST