Re: [htdig] RE: [Cooker] SORT and locale


Subject: Re: [htdig] RE: [Cooker] SORT and locale
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed May 17 2000 - 14:31:53 PDT


According to "NEPOTE Charles (Neuilly Gestion)":
> > On Wed, May 17, 2000 at 01:45:12PM +0200, NEPOTE Charles
> > (Neuilly Gestion) wrote:
> >
> > > With Mandrake 7.0 or a cooker installed in french.
> > >
> > > when I :
> > >
> > > "sort db.wordlist"
> >
> > > I obtain :
> >
> > > tue i:1
> > > tué i:1
> > > tue i:2
> > > tué i:2
> >
> > > Shouldn't it be :
> >
> > > tue i:1
> > > tue i:2
> > > tué i:1
> > > tué i:2
> >
> > No.
> > The whole line is used for sorting; and "1" < "2". while "e"
> > is equal to "é"
> > ("é" becomes > "e" only when all other things are equal).

That's only true when sorting within the collating rules of a specific
language, and the rules vary from language to language. What's more, I
think it's wrong for sort to make the assumption that what you're sorting
is always text in the language of your current locale - what if you're
sorting strings of binary data? Sort has a -d option to sort in phone
directory (or dictionary) order, and a -f option to fold lower case into
the equivalent uppercase, so it seems to me that folding of accented to
unaccented characters should similarly be optional. In any case, I'm
not looking to redesign the sort command to correct design flaws, or to
debate what new features (which affect results) ought to be optional.
I'm just looking for a workaround for this specific problem.

It so happens that htmerge expects the sort program to give a straight
binary data sort, and on most systems that is indeed the case.

> > To have the result you want you must tell to sort on only the
> > first column:
> >
> > sort -k1 db.wordlist
> >
> > which gives:
> >
> > tue i:1
> > tue i:2
> > tué i:1
> > tué i:2

I don't know if this would be sufficient for htmerge and htsearch, because
the ID order may matter too! I can confirm that htmerge/words.cc does require
like words to be adjacent in the file. I haven't dug deep enough into the
code to see how important the ID order is, but I'm inclined to think it will
matter.

> > > Because an accent in french *make sense* : "tué" is very
> > different from
> > > "tue".
> >
> > But, for sorting order, "é" is the same as "e", when major
> differences
> > exist (whcih is the case here, "2" > "1").
> > Coinsider "tuéa" and "tueb" (I know, those words dont exist,
> > is just for
> > example). you expect "tuéa" to be before "tueb" right ?
> > It is the same thing here.

Again, that depends on the language, or the nature of the data you're sorting.
Words are not the only data you'd ever want to sort, and even when you're
sorting words, it may matter whether such folding takes place, depending on
what the sorted data is being fed into. If the goal is to group identical
words, then folding case or ignoring accents gets in the way of that goal.
Also, one should not have to sacrifice sorting secondary fields in order to
get the desired ordering of the primary field.

> > > This result make ht://Dig working bad with some documents...
> >
> > Because your logic is wrong. If you intend to sort on only
> > the first field
> > and not on the i:number thing; you must tell it; with the -k
> > parameter to
> > the command 'sort'.

No, the logic is not wrong. The intention is to sort on the first field
as the primary sort key, to get identical words together, and to sort
on the ID field as the secondary sort key. The ID field appears after
the word in the record so that it carries less weight in the sort.
Most Unix sort programs give a straight ASCII sort by default, so
htmerge's logic is correct given a consistent sort program. The logic
breaks down only because some sort programs have modified the default
behaviour to introduce folding, assuming this is always the desired
behaviour. It's that assumption that is the incorrect logic.

> > > I tried to put :
> > > LC_ALL=en
> >
> > Won't change anything; 'en' locale uses the same LC_COLLATE data.
>
> That's what Gilles Detilleux from the ht://Dig group told me to try.
> I'm sending a copy to him.

I had incorrectly assumed that the en or en_US locale would use different
collating data. I've since been able to reproduce the problem on Red Hat
Linux. Red Hat 6.0 didn't have this problem, but 6.1 does. It seems that
the textutils-2.0 introduced the new, locale-aware sort program. All
debates about design flaws aside, the workaround to this problem is to run
sort with LC_ALL set to "C", to get the correct behaviour from the program.
So, if you do an "export LC_ALL=C" before running htmerge, it should build
the database correctly.

Fortunately, version 3.2 of ht://Dig does away with external sort program,
and so it should avoid such problems.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed May 17 2000 - 12:20:06 PDT