[htdig] accents mapping


Subject: [htdig] accents mapping
From: Robert Marchand (robert.marchand@UMontreal.CA)
Date: Wed Feb 23 2000 - 13:59:48 PST


At 11:03 00-02-23 -0600, Gilles Detillieux wrote:
>
>The upper to lower case mapping is a little different in that it's
>handled pretty much the same way for all languages using an ISO-Latin x
>encoding, and the locale, if properly defined, gives the mapping.
>Locales don't give any information for mapping accented to unaccented
>letters, so that information will have to be provided elsewhere - either
>hardcoded Latin 1 mappings in the code (which would limit its usefulness),
>or configurable by the user.
>
>Also, as we've discussed previously in many threads, it would be much
>better to implement the accent handling as a fuzzy match, rather than
>like the case mapping. As you've realised yourself, patching the code
>elsewhere would require many, many changes in many parts of the code,
>even for a quick-and-dirty solution, so you'd probably end up doing more
>work than it would take to write one new fuzzy match algorithm, with
>less satisfactory results that would be less likely to be incorporated
>into the distribution source.
>

Well, I've pretty much decided to replace some of the "lowercase" calls
in WordList.cc and parser.cc to a similar function but that also does
accents flattening. I'll see tomorrow if it does what we want.
Weighting information is lost (that is the exact match is no better than
the flattened match) but me and my collegue we're not sure a fuzzy
algorithm would be the best. Maybe it has already been discussed but
consider the word "éphémère" (it means something that does not last a
long time).

In order to match it the fuzzy way, I think you would have to generate
all the possible words like éphemere, éphèmére, ephèmere, etc.
There are 4 "e" that can be replaced each with 3 other possible char:
"é", "è", "ê". It means 4x4x4x4 possible words. Of course anyone familiar
with french would know the last "e" has not much chance of having an
accent but there are exception like in "résumé". I'm not saying it can
not be done nor that what I described is to come often but it may be
sufficient and better for us to use the "lowercase" option.

Thanks for responding to me.
A good point about ht://Dig compared to other search engines was the
quality and level of discussion of this list. It's good to have the
experts online.

>Geoff suggests basing in on the Substring fuzzy algorithm, but the more
>I think about it, the more I think it should be done like soundex and
>metaphone. Consider the similarities: soundex and metaphone look at all
>the indexed words, and build a database of these words in a canonical
>form that represents how the word sounds, mapping a given word "sound"
>to all the forms and spellings in the indexed documents that yield that
>sound. The accents algorithm is the same idea, except instead of going
>by sounds, the canonical form is the unaccented equivalent of the word.
>Most of the fuzzy match infrastructure is already in place, so by adding
>an algorightm, you automatically get the search for all alternate forms
>of a word, in both the database and the excerpt highlighting. Plus the
>feature can be selected at run time via the search_algorithms attribute.
>
>The actual canonicalisation of words in that accents algorithm could be
>even simpler than the soundex or metaphone algorithms. You could do it
>with a lookup table that gives one-to-one character mappings. However,
>to provide the added flexibility of of one-to-many, many-to-one and
>many-to-many mappings as well, as we had discussed in a couple earlier
>threads, would make this algorithm even more useful, without adding
>much complexity. The HtWordCodec code that Hans-Peter developed could
>do all of these string mappings for us. I envision using it very much
>like how the url_part_aliases mappings are handled. When you look at
>it this way, implementing this really involves glueing together existing
>pieces, so it shouldn't be that hard, and then configuring it with a
>new config attribute that contains all the accent to plain letter mappings,
>as string pairs in a string list.
>
>There have been a few kludgy approaches to mapping accents before, but
>none did a good job of it (one of them involved blindly zapping data in
>one of the .db files, regardless of whether it was text or not), and so
>none were good enough to include in the distributions. I really don't
>think it would be much more effort to implement a proper fuzzy algorithm,
>and the end result would be so much better.
>
>--
>Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
>Spinal Cord Research Centre WWW:
http://www.scrc.umanitoba.ca/~grdetil
>Dept. Physiology, U. of Manitoba Phone: (204)789-3766
>Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
>
-------
Robert Marchand tél: 343-6111 poste 5210
DiTER-SDI e-mail: marchanr@diter.umontreal.ca
Université de Montréal Montréal, Canada

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Feb 23 2000 - 14:03:27 PST