Re: [htdig] problems with accents


Geoff Hutchison (ghutchis@wso.williams.edu)
Thu, 20 May 1999 12:23:18 -0400


At 11:59 AM -0400 5/20/99, Gilles Detillieux wrote:
>don't treat accented and unaccented letters as having the same sound, or
>it may lead to to many false matches if the sound matching is too vague.

As I recently revised both algorithms to actually work as promised, I can
say you're going to have very mixed results. For one, Metaphone is not
going to work very well on non-English text since it's designed to work on
phonetic sounds. Since the phonetics of text depend on the language, it's
going to butcher foreign texts. (Think about an American trying to read
your language as English.) Soundex might help with the problem of words
with accented and non-accented characters since it drops vowels entirely.
However, it's also designed for "sound" and may still produce some
unexpected results. It's also simply not as accurate as Metaphone--it was
designed by the US Immigration service to help with filing applications by
Surname!

Anyone who knows of more general (or more accurate) fuzzy matching
algorithms or ones designed for other languages should contact me. I would
be especially interested in anyone who could help out with an ispell-like
"spelling correction" fuzzy that could deal with missing accents. Since we
already keep the ispell affix files around, this seems like a useful idea,
but I'm not so sure we want to re-implement ispell.

Designing an accents fuzzy wouldn't be a bad idea, though I wonder whether
there's an easy library call to do the transformation or whether we're
better off using a user-designed lookup table. The latter would obvious
solve the language-dependent nature of this, but a library call would
probably be more efficient.

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu May 20 1999 - 08:35:31 PDT