Re: [htdig] A Suggestion on Accents


Subject: Re: [htdig] A Suggestion on Accents
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue May 16 2000 - 09:35:21 PDT


According to D.J.Adams@soton.ac.uk:
> > >Rather than a fuzzy accents search method, why not make the htdig database
> > >accent independent? After all, it is case independent already!
> > >For example:
> > >
> > >Garçon -> Garšon -> garšon -> garcon
> >
> > I would make the analogy to word suffixes rather than to case. There
> > is an endings fuzzy rather than a general stemming step during
> > indexing. IMHO, this makes searches a bit more precise because the
> > alternatives will get less weight than what the user actually
> > entered. (Remember the old maxim "the customer is always right?")
> >
> > Besides, there are some situations where the unaccented word and the
> > accented word do *not* mean the same thing.
>
> Yes, and when I search for 'garšon' am I looking for a waiter or a school boy?

Yes, homonyms pose a problem in searches. However, while we can't do
anything to solve the ambiguities in languages, we can avoid introducing
further ambiguities by not stripping out information which may be
relevant.

The big advantage of dealing with accents as a fuzzy match algorithm
(apart from the fact that most of the infrastructure for this was already
in place) is that it can be selectable and configurable at search time.
For some searches, you may want to treat accented and unaccented letters
as equivalent, but in some circumstances that is not desired.

By putting the search_algorithm attribute in the hands of the user (which
is very easy to do), one can select what weight, if any, the unaccented
counterparts of an accented word will have in a search. By stripping
out accents at indexing time, that act is somewhat irrevocable, and
completely out of the hands of the configurator or user at search time.

I'm having a hard time seeing what the downside of the fuzzy algorithm is,
other than the extra step of building the accents database, and the space
it will take.

Personally, I don't think the analogy between letter case and accents
is a valid one. In most languages accents convey more information than
letter case does. Indeed, in some languages, an accented letter is a
completely different letter than its unaccented counterpart, occupying
its own slot in that language's alphabet. In Swedish, for example,
÷ is not the same as o, and it comes towards the end of the alphabet.
The analogy probably holds true only in English, where accents are viewed
as little more than embellishments that convey little information.

Using a fuzzy algorithm for accents makes sense because the relationship
between accented and unaccented letters is just that - fuzzy. In some
cases you want to treat them as equivalent (or close to it) because
accents are often incorrectly omitted in indexed documents, or because
they are often difficult for users to enter correctly. In other cases,
you want to make a clear distinction.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue May 16 2000 - 07:23:32 PDT