Re: [htdig] One possible solution for french accents support


Salim Gasmi (salim@gasmi.net)
Fri, 29 Oct 1999 23:55:45 +0200


At 14:56 29/10/99 -0500, you wrote:

>Yikes! I have a hard time believing that your patch_accents program would
>not start clobbering all sorts of data in db.docdb that it shouldn't.
>I'm assuming the whole point of this is to strip out the accents from
>the document excerpts, so that excerpt highlighting works for unaccented
>search words.

>If so, why not just strip out the accents on the fly in
>htsearch/Display.cc, before doing any searches on the excerpt, or
>better yet, just poke in some entries in the translate table, set in
>StringMatch::IgnoreCase() (in htlib/StringMatch.cc), to map accented
>letters to equivalent lower-case unaccented letters? The letter mapping
>in String.cc could also be done much more efficiently with a mapping
>table.

>The best approach, though, would be to define a new "accent" fuzzy match
>algorithm, which, when given a word, would search the word database
>for all accented and unaccented equivalents. The main engine of this
>would be very much like the current htfuzzy/Substring.cc algorithm.
>It would be more work, but you'd have something that would be selectable
>by the search_algorithm config attribute, and would fit in well with
>the existing code.

Gilles,

I agree with all of your remarks.

I have been also amazed by the fact that my patch_accent
was not totally corrupting de db file ;)

Looks like, ASCII codes modified are not used as separators and attributes.
Please note I took care of only modifying bytes that were in an ASCII
string :)

In fact I just have written this patch to match my purposes.

I made it public because after searching "accents french"
on the htdig site, I found a huge numbers of people trying
to get a solution ....

Don't be wrong, this patch is not an academic one,
it is a dirty and straightforward one (as I said on my page).

My point of vue, of a *good* patch is something like a conf file, let's
call it transcode.conf
which would contains characters equivalences.
this file would be used by htsearch and htfuzzy.

Best regards,

Salim

***********************************************
Salim Gasmi <http://www.gasmi.net>
System and network administrator.
SdV Plurimedia <http://www.sdv.fr>

PGP Key: http://www.gasmi.net/pgp.txt
***********************************************

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word unsubscribe in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Oct 29 1999 - 15:01:52 PDT