RE: [htdig] Accent problem.


Subject: RE: [htdig] Accent problem.
From: NEPOTE Charles (Neuilly Gestion) (charles.nepote@cetelem.fr)
Date: Tue May 16 2000 - 08:31:21 PDT


According to Gilles Detilleux:

> According to "NEPOTE Charles (Neuilly Gestion)":
> > I am searching to solve some problems in ht://Dig 3.1.5.
> >
> > I tested and reproduce that :
> >
> > If :
> > -- more than one html file contains : both words "tué" and
> "tue" per file ;
> > -- or an html files contains the word "tue" and the html
> which is reffering
> > to it contains the word "tué" (or the reverse case)
> > [exemple : d0.htm containing "<a href="d1.htm">UN HOMME
> TUE</a>" and
> > d1.htm containing "tué"]
> >
> > Then a search for "tué" or a search for "tue" will only
> find the last file
> > indexed which contains both "tué" and "tue".
> >
> > In the file db.wordlist we can see for example :
> > tue i:0 [...]
> > tue i:1 [...]
> > tué i:1 [...]
> > tue i:2 [...]
> > tué i:2 [...]
> >
> > (only the file which correspond to "i:2" will be found).
> >
> > Is this can be solve ?
> > (Note I have in htdig.conf :
> > locale: fr_FR
> > )
>
> Yes, the locale seems to be working fine, as accented letters
> are taken as
> part of the words in the word list. I assume the entries
> from db.wordlist are as you find them after running htmerge.

Yes.

> It's odd, but the sort seems to treat accented and
> unaccented letters as equivalent, and I wonder if
> that isn't throwing off htmerge's creation of the db.words.db
> database.
> Otherwise, it seems all the "tué" entries should come after
> the "tue" entries.

Yes, that's it.

> Either that, or the latter database is corrupted,
> and so isn't
> working right.
>
> Does the problem persist even after you regenerate the database from
> scratch? (htdig -i; htmerge)

I made very serious test : with only 7 documents, always regenerating the
database from scratch to prevent corruption problems of the database ; to do
so I used :
time rundig -v -s -a -c /etc/htdig/htdig.essai.conf|tee
/var/lib/htdig/essai1.txt

and I always controlled the process. The rundig script is the original
script (not modified).
I am quite sure the database is not corrupted.
So it should be a problem of sorting...

My config :
Pentium Pro 200
Linux Mandrake 7.0 ; automatic install in french.
(As I am a Linux newbie, I don't know which things would help you. One think
I am quite sure is I didn't made much changes on the original config. In
particular, I didn't make "locale" changes (I don't know how to do it
!...)).

ht://Dig 3.1.5 installed via a RPM specially made for Mandrake 7.0, by
MandrakeSoft, downloded at :
ftp://ftp.ciril.fr/pub/linux/mandrake-devel/contrib/RPMS/htdig-3.1.5-2mdk.i5
86.rpm
(note ftp.ciril.fr is an official mirror for MandrakeSoft).
I made an normal install of the RPM without changing anything but the
htdig.conf file :
 -- I add locale: fr_FR
 -- I modified other attributes which not deal with locale problem.

> You may also want to try setting your
> LOCALE environment variable to something other than fr_FR
> (e.g. en_US),
> so that the sort will not do any accent folding, if indeed that is
> the problem.

Strange thing : when I put locale: en_US in htdig.essai.conf, the result is
the same !
And accented chars are still in db.wordlist, in the same order as before...

 
> > <cultural parenthesis>
> > At the beginning of automatic typewritters (first moity of
> the century),
> > there was nos accented uppercases such as ÉÈ (the machines were
> > anglo-saxons) and so, the usage of accented lowercase
> desapear in common
> > usage : nowadays, many teachers in France teach that "there
> is never accent
> > in a lowercase". (In fact there is accented lowercase in
> all newpapers,
> > books printed by professionnals who know the rule that there must be
> > accented lowercase -- there is accented lowercase in France
> since the
> > beginning of prints).
> > This is a problem as accents have a sence :
> > "un homme tué" : means "a man killed"
> > "un homme tue" : means "a man kills".
> > How to understand : "UN HOMME TUE" if there is no accented
> lowercase ?
> > </cultural parenthesis>.
>
> I believe you mean uppercase where you say lowercase.
> Uppercase letters
> are capitals (majuscules), while lowercase letters are small
> (minuscules).

Ooops, yes ! Sorry.

> Some French teachers in Canada also taught not to put accents
> on capitals,
> but it didn't really catch on. I never realized that
> convention came about
> just because of the difficulty of using accents on typewriters.

Actual machines are still going against cultural diversity : there is
nothing to type easily accented UPPERCASE on are french (and probably even
Quebec) keyboards. (You have to remember Alt+0201 for an É...).

 
> --
> Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW:
> http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
>
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> htdig-unsubscribe@htdig.org
> You will receive a message to confirm this.
>



This archive was generated by hypermail 2b28 : Tue May 16 2000 - 06:20:44 PDT