Re: [htdig] Accent problem.

Subject: Re: [htdig] Accent problem.
From: Gilles Detillieux (
Date: Tue May 16 2000 - 10:36:02 PDT

According to "NEPOTE Charles (Neuilly Gestion)":
> I made very serious test : with only 7 documents, always regenerating the
> database from scratch to prevent corruption problems of the database ; to do
> so I used :
> time rundig -v -s -a -c /etc/htdig/htdig.essai.conf|tee
> /var/lib/htdig/essai1.txt
> and I always controlled the process. The rundig script is the original
> script (not modified).
> I am quite sure the database is not corrupted.
> So it should be a problem of sorting...

Yes, it seems that way.

> My config :
> Pentium Pro 200
> Linux Mandrake 7.0 ; automatic install in french.
> (As I am a Linux newbie, I don't know which things would help you. One think
> I am quite sure is I didn't made much changes on the original config. In
> particular, I didn't make "locale" changes (I don't know how to do it
> !...)).
> ht://Dig 3.1.5 installed via a RPM specially made for Mandrake 7.0, by
> MandrakeSoft, downloded at :
> (note is an official mirror for MandrakeSoft).

I'm not familiar with this RPM, but it sound to me, from other messages
on the list, that it was properly built, so we'll assume it's OK.
It's been reported that the Red Hat RPMs on do NOT work
correctly on Mandrake systems. I think we've ruled out such a mismatch
in your case, however.

> I made an normal install of the RPM without changing anything but the
> htdig.conf file :
> -- I add locale: fr_FR
> -- I modified other attributes which not deal with locale problem.
> > You may also want to try setting your
> > LOCALE environment variable to something other than fr_FR
> > (e.g. en_US),
> > so that the sort will not do any accent folding, if indeed that is
> > the problem.
> Strange thing : when I put locale: en_US in htdig.essai.conf, the result is
> the same !
> And accented chars are still in db.wordlist, in the same order as before...

That's because the locale that you set in your htdig config file does
not get passed via the environment to the sort program. I'm suspecting
that your sort program is collating based on collating rules for French.
What htmerge and htsearch require, though, is a straight sort of the
data based on numerical values of characters, as they don't look at the
LC_COLLATE information. If your LOCALE environment variable is normally
set to fr_FR, then that's what sort will see, regardless of what you
set in your htdig.conf. Instead, try:

        export LOCALE=en_US # for sh, bash, ksh, zsh, etc.
        setenv LOCALE en_US # for csh or tcsh

before running rundig, to see if that affects the sorting order. It may
be sufficient to set just the environment variable LC_COLLATE to en_US,
so that only the collating sequence is affected.

> > Some French teachers in Canada also taught not to put accents
> > on capitals,
> > but it didn't really catch on. I never realized that
> > convention came about
> > just because of the difficulty of using accents on typewriters.
> Actual machines are still going against cultural diversity : there is
> nothing to type easily accented UPPERCASE on are french (and probably even
> Quebec) keyboards. (You have to remember Alt+0201 for an ...).

That's unfortunate. I think some French Canadian keyboard layouts give
the ability to put accents on capitals, but in this day and age there's
no reason for any layout not to do so - although it can be an effort to
relearn a new layout after using a crippled one for years. It would make
sense for all accents to be treated as dead keys, including the acute
accent (aigu) and the cedilla. The manual typewriter I used in typing
class had and on one key, just right of the period, and I thought
that was a poor design.

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Tue May 16 2000 - 08:24:07 PDT