Subject: Re: Re: Re: [htdig3-dev] Characters like 'à'
From: Gilles Detillieux (firstname.lastname@example.org)
Date: Tue Dec 14 1999 - 12:50:36 PST
According to Gabriele Bartolini:
> >libc5 :
> > ftp://ftp.lip6.fr/pub/linux/GCC/WG15-collection.linux.tar.gz
> OK . I installed the locale specifications for Italy by using the localedef
> program. They are now into the directory /usr/share/locale/it_IT and these
> are the files:
> LC_COLLATE LC_CTYPE LC_MESSAGES LC_MONETARY LC_NUMERIC LC_TIME
> BUT I STILL KEEP ON RECEIVING THE MESSAGE: "Unknown locale !!!". :-||
> I think it's a problem of the setlocale function, which returns NULL.
> Why can this call return a NULL value? Doesn't it find the locale files? Or
> are they incomplete?
What C library are you using? Your message above implies libc5
on Linux. I'm using Red Hat Linux 4.2 on my web server, which comes
with libc 5.3.12. I've tried all sorts of things, and I've come to the
conclusion that locale support in this C library is hopelessly broken.
I could not get it to work despite all my attempts.
On the other hand, with Red Hat Linux 5.2, which uses glibc, locales
seem to work without any difficulties at all.
I've thought of how ht://Dig could be fixed to work with broken locales.
The extra_word_characters attribute is a good first step. If you add
all the accented characters to this, they'll get indexed. The problem
is ht://Dig won't know how to convert them from uppercase to lowercase,
or vice versa. I've thought of adding extra_word_casemap as a means
of specifying these mappings. In this way, the HtWordType functions
would supplement all the ctype stuff, in a way that's user configurable.
It's a shame that we'd need to resort to this, because this is exactly
what the locale stuff is supposed to do for us, but with so many broken
locales out there, I think there's a need for this.
As for mapping accented to unaccented letters, as Geoff said, this has
been discussed to some length about a week or so ago. My suggestion
was to implement it something like soundex, where it will go through the
word database after htdig/htmerge, and create another database keyed on
the canonical (unindexed) form of all of these words. This algorithm
could be configured either through a file, or perhaps better still,
a config attribute (which could be taken from a file if desired) such
as accent_map. This map would allow you to specify precisely how to map
various accented letters or digraphs to certain canonical representations.
The patch just posted puts the accent stripping in SGMLEntities.cc,
which is altogether the wrong place for this, for at least 3 reasons:
1) this module no longer exists in 3.2, 2) accents may not always be
specified using SGML entities, 3) for the words it does affect, accents
will also be stripped out of the excerpts that htsearch displays, which is
not ideal. Implementing accent mapping as a fuzzy match method overcomes
all of these problems, plus it's selectable by the search configuration.
It's also a "good fit", as this is exactly the sort of thing that fuzzy
matching is all about - you're specifying exactly in which way the search
should be less "exact", in this case by fuzzying the destinction between
accented and unaccented letters. It just doesn't make sense to implement
it any other way.
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Tue Dec 14 1999 - 13:04:30 PST