[htdig] Re: accents mapping


Subject: [htdig] Re: accents mapping
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Feb 23 2000 - 14:35:47 PST


According to Robert Marchand:
> At 11:03 00-02-23 -0600, Gilles Detillieux wrote:
> >The upper to lower case mapping is a little different in that it's
> >handled pretty much the same way for all languages using an ISO-Latin x
> >encoding, and the locale, if properly defined, gives the mapping.
> >Locales don't give any information for mapping accented to unaccented
> >letters, so that information will have to be provided elsewhere - either
> >hardcoded Latin 1 mappings in the code (which would limit its usefulness),
> >or configurable by the user.
> >
> >Also, as we've discussed previously in many threads, it would be much
> >better to implement the accent handling as a fuzzy match, rather than
> >like the case mapping. As you've realised yourself, patching the code
> >elsewhere would require many, many changes in many parts of the code,
> >even for a quick-and-dirty solution, so you'd probably end up doing more
> >work than it would take to write one new fuzzy match algorithm, with
> >less satisfactory results that would be less likely to be incorporated
> >into the distribution source.
> >
>
> Well, I've pretty much decided to replace some of the "lowercase" calls
> in WordList.cc and parser.cc to a similar function but that also does
> accents flattening. I'll see tomorrow if it does what we want.

I'm pretty sure that you'll eventually realise that's not the best way
to do this, but you're welcome to find out for yourself. I think once
you start digging in the StringMatch code, you may want to turn around,
but perhaps not. I'm ready to be proven wrong.

> Weighting information is lost (that is the exact match is no better than
> the flattened match) but me and my collegue we're not sure a fuzzy
> algorithm would be the best. Maybe it has already been discussed but
> consider the word "éphémère" (it means something that does not last a
> long time).
>
> In order to match it the fuzzy way, I think you would have to generate
> all the possible words like éphemere, éphèmére, ephèmere, etc.
> There are 4 "e" that can be replaced each with 3 other possible char:
> "é", "è", "ê". It means 4x4x4x4 possible words. Of course anyone familiar
> with french would know the last "e" has not much chance of having an
> accent but there are exception like in "résumé". I'm not saying it can
> not be done nor that what I described is to come often but it may be
> sufficient and better for us to use the "lowercase" option.

Consider the soundex and metaphone analogy I brought up earlier.
Any "sound" may have many possible letters or letter combinations to
produce them. When applied to long words, you'd have even more possible
words than for your "éphémère" example above. But soundex and metaphone
don't generate ALL possible words. They look at all the words that have
been indexed, and record all the canonical forms of these words only,
so that when you look up a given word, it will also search for other
words that it knows are in the index that have a similar sound.

I'm suggesting the same approach for accents. This algorithm wouldn't
have to generate words like ephèmere if that word doesn't appear in
any indexed document. Only the actual spellings used in documents,
or by the user in the search query, would be mapped. This is quite
different from the synonyms and endings algorithms, which produce static
mappings from pre-defined dictionaries, and don't change their mappings.
The soundex and metaphone algorithms must rebuild their databases each
time you reindex, to get any new words, and the accents algorithm should
do likewise.

For example, after indexing some English and French documents together,
and running "htfuzzy accents", the new database may contain "résumé",
"résume", "resumé", and "resume", which would all have the same canonical
form, but it's pretty unlikely it would find "rêsumé" or "rèsumê", or
any other silly variations, in any documents (unless you're indexing
the htdig mailing list archives :-P), so these just won't enter the
picture in fuzzy matching.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Feb 23 2000 - 14:39:22 PST