Re: [htdig] Accentuated characters


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 25 Aug 1999 10:16:49 -0500 (CDT)


According to Tomas Garcia Ferrari:
> I'm trying to find a sollution for the problem of words with accentuated
> characters. I'm running a website in german and english, and a lot
> of information is about people, so I have last names with accents,
> umlauts, etc.
>
> How can I do to set up ht://Dig to find 'garcía' = 'garcia', as example...???

This has been discussed a few times before, but a good solution has yet to
be devised. Last time the topic came up, I suggested an "accent" fuzzy
match to handle equivalent vowels like this. Nobody has volunteered to
implement such a strategy, though. Part of the problem is that some of
the equivalences may be language dependent. E.g. ö -> o in French but
ö -> oe in German.

An interim solution would be to use the synonyms fuzzy match to deal
with common equivalences. That would require either manually adding
all variations, or at least the most common ones, to your synonyms file,
and rebuilding the synonyms database. This process could be automated
somewhat by running a dictionary for your language through an accent
stripping filter that would spit out all possible combinations for
accented words. E.g.:

        résumé

would generate

        résumé resumé résume resume

That may end up generating a huge synonyms database, which may adversely
affect search performance (you won't know until you try, I guess).
It would also mean that words not in the dictionary (e.g. proper nouns)
may get missed. This could be made more thorough by periodically
processing all words in your db.wordlist similarly.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Wed Aug 25 1999 - 08:18:45 PDT