Re: [htdig] Two languages and accentuated words


Subject: Re: [htdig] Two languages and accentuated words
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Sep 21 2000 - 09:46:36 PDT


According to Manuel Monteiro:
> > The files you mention are normally in your "common" directory.
> > The db.wordlist file should be in your "db" directory, as defined by
> > the database_dir attribute.
>
> I've checked the db file but the word 'Seminário' is not present. This is
> valid for all accentuated words.
>
> I've tried to use both en_US and en_US.ISO8859-1 without success. I' ll try
> to learn how to add another locale setting.
> (After changing anything in the config file i run rundig, must i do anything
> else?)

Just running rundig should be sufficient, if you're running the standard
rundig script that came with the package. The only little snag, when
you're running a setup for a different language, is that rundig contains
hardcoded references to the standard english.0 and synonyms dictionaries.
So, it may not run the "htfuzzy endings" and "htfuzzy synonyms" commands
the first time they're needed (or it may needlessly run them each time -
I'm not sure which behaviour will occur, but it would likely depend on
whether you ran rundig once before customising for another language).
That is fairly easily remedied, either by running the commands once
manually, and/or modifying the script to correct this problem.

So, is htdig splitting all accented words at the letter with an accent?
I.e., for 'seminário', is it making two entries in db.wordlist for 'semin'
and 'rio'? If so, it is treating the accented letter as punctuation
(actually as a control character, which htdig processes just like
punctuation), which is the standard behaviour when your locale is not
set up correctly.

I find it unusual, but not unbelievable, that the en_US locale doesn't
handle accents properly. On most systems, and certainly on glibc-based
Linux systems, most western-European locales and the en_US locale all
use the same LC_CTYPE map, which recognises all ISO-8859-1 (Latin 1)
accented characters as letters. However, there are some systems that
impose a stricter and more language-specific ctype map. I forget
which system it was, but there was one in which the fr_FR recognised
only accented letters that are actually used in the french language,
so that a letter like 'á' was not recognised as a letter in that locale.
On some systems, such as most libc5-based Linux systems, locale support
seems to be hopelessly broken, so no locale will give proper support
for accents. So, as the saying goes, your mileage may vary.

I don't believe you ever mentioned, in any of your e-mails, which OS
you're running, or which version and distribution. If you do, maybe
someone on the list with a similar system can shed some light as to how
to get locale support working, if indeed that is possible on your system.
(I've given up on locale support on my old Red Hat 4.2 system, which
uses a broken libc5 C library.) Also, if you're running htdig from
an RPM distribution, and you installed the wrong build, that may cause
locale problems even if they work on your system.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Thu Sep 21 2000 - 09:49:41 PDT