Re: [htdig] One possible solution for french accents support

Salim Gasmi (
Fri, 29 Oct 1999 23:55:45 +0200

At 14:56 29/10/99 -0500, you wrote:

>Yikes! I have a hard time believing that your patch_accents program would
>not start clobbering all sorts of data in db.docdb that it shouldn't.
>I'm assuming the whole point of this is to strip out the accents from
>the document excerpts, so that excerpt highlighting works for unaccented
>search words.

>If so, why not just strip out the accents on the fly in
>htsearch/, before doing any searches on the excerpt, or
>better yet, just poke in some entries in the translate table, set in
>StringMatch::IgnoreCase() (in htlib/, to map accented
>letters to equivalent lower-case unaccented letters? The letter mapping
>in could also be done much more efficiently with a mapping

>The best approach, though, would be to define a new "accent" fuzzy match
>algorithm, which, when given a word, would search the word database
>for all accented and unaccented equivalents. The main engine of this
>would be very much like the current htfuzzy/ algorithm.
>It would be more work, but you'd have something that would be selectable
>by the search_algorithm config attribute, and would fit in well with
>the existing code.


I agree with all of your remarks.

I have been also amazed by the fact that my patch_accent
was not totally corrupting de db file ;)

Looks like, ASCII codes modified are not used as separators and attributes.
Please note I took care of only modifying bytes that were in an ASCII
string :)

In fact I just have written this patch to match my purposes.

I made it public because after searching "accents french"
on the htdig site, I found a huge numbers of people trying
to get a solution ....

Don't be wrong, this patch is not an academic one,
it is a dirty and straightforward one (as I said on my page).

My point of vue, of a *good* patch is something like a conf file, let's
call it transcode.conf
which would contains characters equivalences.
this file would be used by htsearch and htfuzzy.

Best regards,


Salim Gasmi <>
System and network administrator.
SdV Plurimedia <>

PGP Key:

