[htdig] Troubles with accents on a french website


Subject: [htdig] Troubles with accents on a french website
From: Saad Kadhi (Saad.Kadhi@neurocom.com)
Date: Tue Dec 19 2000 - 11:08:39 PST


Hi List,

This is a rather long email. So "executive version" might be: How can I
support searching a French Website that contains accents with htdig
running on a system that doesn't support locales ?

"tech version" :
I'm trying to get ht://dig 3.1.5 to index/search a French website
containing accents. I have *extensively* RTFMed on the mail archives &
the contributed work/guides. My first try was "le kit de francisation"
made by Didier Lebrun. After running the given rundig, I realized that
this kit cannot be the solution since htdig is running on an OpenBSD
machine which doesn't support locales :(

I then tried Daniele Bufarini's accents.zip patch which is supposed to
allow htdig to dig the documents & strips the accents before putting
them into the words db & it also modifies htsearch in such a way that it
can strip accents from the search keywords & search for the non-accented
equiv.

If I'm getting it straight, this patch == my solution since:
1. htdig "digs" into the french documents & then strips the accents
before constructing the dbs (vulnérabilité will then be referenced as
vulnerabilite)
2. when a user enters "vulnérabilité", htsearch interprets it as
"vulnerabilite" & matches because of 1

but it doesn't work at all !!! In fact, after some debugging, we
discovered that the patch only strips the first encountered accent of a
given letter (so sécurité becomes securité & tétâtétâ becomes tetatétâ).
So a rather lame thought occured to me: I mixed Daniele's patches with
the files from "le kit de francisation" (synonyms.fr, francais.0,
francais.aff, bad_words.fr ...). It did no good (I dunno why I did that).

I ended up testing the remaining accents patch accents.5 by Robert
Marchand. It doesn't work. Well, this final patch may be a lead to a
solution. As I said before, my system doesn't support locales. So if I
can get htdig to dig the accented words into their unaccented equivs &
if htsearch strips accents from the queries before looking them up,
It'll be great.

I've been working on the subject for more than 4 days now non-stop & I
may end up in an asylum :-( or reformatting the machine into Linux which
has locales but I'd really luv to see htdig humming on my OpenBSD box
instead. So if you have any idea/patch/tip/whatever to get this to work,
I'll be *extremely* grateful.

Thanks in advance for *any* help you can provide us with !
Regards,
Saad.

-- 
Saad Kadhi -- Network & Security Engineer
----------------------------------------------------------------
preferred TV Program: OpenBSD Vs. Script Kitties starring RamBlow
outstanding Unix features: immutable & sappnd
preferred kernel security level: 2
preferred holy saying: RTFM 
nodisclaimer

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Tue Dec 19 2000 - 10:17:34 PST