Re: [htdig] problems with accents


Geoff Hutchison (ghutchis@wso.williams.edu)
Thu, 20 May 1999 17:25:21 -0400


At 1:20 PM -0400 5/20/99, Gilles Detillieux wrote:
>In any case, the list you generate should probably be sorted and merged
>with your existing synonyms file, before you build the database.

This sounds like a good idea, though it doesn't really need to be sorted.
As a Perl script, it could easily sort and/or merge with the current
synonym file.

>Does this sound like a workable solution, or would it result in a huge,
>unwieldy synonyms database that would cause really poor performance?
>Geoff, as you're more familiar with the synonyms algorithm than me,
>maybe you'd have an idea about this?

The synonym database consists of each word in the synonym file, followed by
all of the other words on the same line. But since a search just does a
lookup of the word and returns all the synonyms, it's a very fast
technique. Unless we're talking about something on the order of several
hundred thousand words, it's probably going to be fine performance-wise.
(Well, a nice fast disk never hurt anyone either.)

However, a separate algorithm to deal with accents may be faster. I'll be
glad to take that up with interested parties on the htdig3-dev list, but
the basic idea is that you have a small subset of accent/unaccent rules and
you generate a small subset of possible variants on a search word. This
technique wouldn't need a separate database file--it would probably be fast
enough on its own. e.g.:

resume -> résumé
résumé -> resume
Schrödinger -> Schroedinger

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu May 20 1999 - 13:44:36 PDT