Subject: Re: [htdig] RE: [Cooker] SORT and locale
From: Pablo Saratxaga (
Date: Thu May 18 2000 - 03:57:17 PDT


On Wed, May 17, 2000 at 04:31:32PM -0500, Gilles Detillieux wrote:

> It so happens that htmerge expects the sort program to give a straight
> binary data sort, and on most systems that is indeed the case.

Ah... I thought that the words have to be sorted according to the user
If they have to be sorted only based on their code vlaue that's easy:
set LC_ALL=C and you are done!

OTOH if you need sorting of words it makes more sense to use proper locale
and use -k switch if needed.

> > > > This result make ht://Dig working bad with some documents...
> > >
> > > Because your logic is wrong. If you intend to sort on only
> > > the first field
> > > and not on the i:number thing; you must tell it; with the -k
> > > parameter to
> > > the command 'sort'.
> No, the logic is not wrong. The intention is to sort on the first field
> as the primary sort key, to get identical words together, and to sort
> on the ID field as the secondary sort key.

Then the logic *is* wrong :)
sorting the whole line as an atomic bunch is not the same as sorting first
based on 1st field, then on 2nd field to differentiate those lines whose
first field is the same.

'sort file' will sort *lines* no matter of fields
'sort -k1,1 -k2,2 file' will sort according to what you tell about 1st field
        as primary key, then second field as secondary key.

the result may be the same when you treat values as arbitrary codes; but
it is not the same (as this thread proves it) when the things to be sorted
are language words (which seemed to be the intent of the post originating
it all).

> The ID field appears after
> the word in the record so that it carries less weight in the sort.
> Most Unix sort programs give a straight ASCII sort by default, so
> htmerge's logic is correct given a consistent sort program.

No, the logic is correct given an assumption that the things sorted are
not words of a given language.

> The logic
> breaks down only because some sort programs have modified the default
> behaviour to introduce folding, assuming this is always the desired
> behaviour. It's that assumption that is the incorrect logic.

Well, it makes sense to assume that by default people want to sort words.
that is sure the most common usage.
Anyway, if you don't like it complain to POSIX not us.

The real fix would be to do the call to sort setting LC_ALL=C first.
But you should check maybe if an option to have sorting of words won't be
wanted by some people (it seems that it is indeed the case); in which case
the locale must remain untouched, and the sort command should be checked
to see if it accepts POSIX -k parameters, then if that is the case use
-k1,1 -k2,2

> > > > I tried to put :
> > > > LC_ALL=en
> > >
> > > Won't change anything; 'en' locale uses the same LC_COLLATE data.
> I had incorrectly assumed that the en or en_US locale would use different
> collating data.

It is the 'C' locale that does it.
'en' is a human language locale for English and uses normal English
sorting rules, while 'C' locale is intended for computer sorting without
human meaning apparently.

> I've since been able to reproduce the problem on Red Hat
> Linux. Red Hat 6.0 didn't have this problem, but 6.1 does.

Because the version of sort in RH 6.0 lacked locale support
(which made it almost useless for languages like Russian or Greek btw)

> It seems that
> the textutils-2.0 introduced the new, locale-aware sort program. All
> debates about design flaws aside, the workaround to this problem is to run
> sort with LC_ALL set to "C", to get the correct behaviour from the program.
> So, if you do an "export LC_ALL=C" before running htmerge, it should build
> the database correctly.

> Fortunately, version 3.2 of ht://Dig does away with external sort program,
> and so it should avoid such problems.
Ki a vos vye bn,
Pablo Saratxaga PGP Key available, key ID: 0x8F0E4975

