Re: [htdig] problems with accents


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 20 May 1999 12:20:42 -0500 (CDT)


Here's an idea I just had. How about writing a quick and dirty perl,
awk or C program that goes through all the words in your dictionary
file (e.g. francais.0) and/or your db.wordlist, and produces a list of
synonyms for accented and unaccented forms. The mappings for accented to
unaccented letters would be in this program, which would be customised
for your character set and language. E.g., when it finds a line like
one of the following:

étude
étude/S
étude i:390 l:713 w:570 c:2 a:16

it spits out

étude etude

This could then be used to to generate the synonyms database for the
synonyms fuzzy algorithm. You'd also need to decide whether to handle
multiple combinations of accents. E.g. should "résumé" produce:

résumé resume

or

résumé resumé résume resume

Either way, the program to do this wouldn't be that complicated.

If you generate such a list from db.wordlist, you may get words that
aren't in your dictionary, which may be desirable if you dictionary
isn't really complete. However, in this case you would probably need
to regenerate the list every time you index (or every few times) to
incorporate new words.

In any case, the list you generate should probably be sorted and merged
with your existing synonyms file, before you build the database.

Does this sound like a workable solution, or would it result in a huge,
unwieldy synonyms database that would cause really poor performance?
Geoff, as you're more familiar with the synonyms algorithm than me,
maybe you'd have an idea about this?

According to Geoff Hutchison:
> At 11:59 AM -0400 5/20/99, Gilles Detillieux wrote:
> >don't treat accented and unaccented letters as having the same sound, or
> >it may lead to to many false matches if the sound matching is too vague.
>
> As I recently revised both algorithms to actually work as promised, I can
> say you're going to have very mixed results. For one, Metaphone is not
> going to work very well on non-English text since it's designed to work on
> phonetic sounds. Since the phonetics of text depend on the language, it's
> going to butcher foreign texts. (Think about an American trying to read
> your language as English.) Soundex might help with the problem of words
> with accented and non-accented characters since it drops vowels entirely.
> However, it's also designed for "sound" and may still produce some
> unexpected results. It's also simply not as accurate as Metaphone--it was
> designed by the US Immigration service to help with filing applications by
> Surname!
>
> Anyone who knows of more general (or more accurate) fuzzy matching
> algorithms or ones designed for other languages should contact me. I would
> be especially interested in anyone who could help out with an ispell-like
> "spelling correction" fuzzy that could deal with missing accents. Since we
> already keep the ispell affix files around, this seems like a useful idea,
> but I'm not so sure we want to re-implement ispell.
>
> Designing an accents fuzzy wouldn't be a bad idea, though I wonder whether
> there's an easy library call to do the transformation or whether we're
> better off using a user-designed lookup table. The latter would obvious
> solve the language-dependent nature of this, but a library call would
> probably be more efficient.
>
>
> -Geoff Hutchison
> Williams Students Online
> http://wso.williams.edu/
>
>

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu May 20 1999 - 09:36:02 PDT