Re: [htdig] Re: accents mapping

Subject: Re: [htdig] Re: accents mapping
From: Robert Marchand (robert.marchand@UMontreal.CA)
Date: Thu Feb 24 2000 - 09:05:49 PST

At 18:28 00-02-23 -0600, Geoff Hutchison wrote:
>At 4:35 PM -0600 2/23/00, Gilles Detillieux wrote:
>>Consider the soundex and metaphone analogy I brought up earlier.
>>Any "sound" may have many possible letters or letter combinations to
>>produce them. When applied to long words, you'd have even more possible
>>words than for your "éphémère" example above. But soundex and metaphone
>>don't generate ALL possible words. They look at all the words that have
>>been indexed, and record all the canonical forms of these words only,
>>so that when you look up a given word, it will also search for other
>>words that it knows are in the index that have a similar sound.
>Yeah, I think you're right that an on-the-fly fuzzy isn't going to be
>very fast. Of course the problem with something based on the soundex
>or metaphone algorithms is that you have to be sure to run htfuzzy
>periodically, but the lookups would be pretty fast.
>But to echo what Gilles said, you really don't want to be messing
>around in WordList or parser, especially if you don't know what
>you're doing. I think the Fuzzy class is pretty self-explanatory and
>almost anyone could write a fuzzy class. The key for the Soundex and
>Metaphone variety is the generateKey() method. The key for the
>Speling and Substring variety is the getWords() method.
>-Geoff Hutchison
>Williams Students Online


  what are the step to create a new fuzzy algorithm?
  I mean, apart from create a new class, what need to be changed
  in order to register a name to be use in the configurations files?

  do the main htsearch also nee to be changed ?
  Is there documentation for this process?

for the record the modifications I've done in WordList and parser seem
to work and it was pretty easy but there are problems and I want to
take a look at the 'fuzzy way' which is certainly more elegant.

One problem I've seen with my approach is that the endings database is
untouched so a search for "Université" is expanded in (université or
universités) while a search for "Universite" is not. This was the case
before the patch but it is more apparent now. I'm not sure the fuzzy
algorithm would cure it unless fuzzy algorithms are applied on each


P.S.: I have my patches available should anybody want to look at them.

Robert Marchand tél: 343-6111 poste 5210
DiTER-SDI e-mail:
Université de Montréal Montréal, Canada

To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Thu Feb 24 2000 - 09:09:34 PST