Re: [htdig] Two languages and accentuated words


Subject: Re: [htdig] Two languages and accentuated words
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Fri Sep 22 2000 - 10:02:04 PDT


According to Manuel Monteiro:
> Dear Gilles Detillieux,
>
> Thanks for your help.
>
> Firt let me say that i'm running Tru64 Unix 4.0f on Compaq Alpha computers
> and had to compile ht://dig from source.
>
> My problem was indeed the locale setting. After intall one subset from OS
> (Single Byte European Subset) i could set LANG to pt_PT.ISO8859-1, after
> this ht://dig worked without changing anything else.

Hey, that's good news. Come to think of it, it may well have been
a Compaq or Digital system that had been reported previously to have
fairly restrictive ctype maps in some locales. At least it's a fairly
easy fix to install the right locale for your needs.

> Now i'll try to hadle the possibility of a user write one word with or
> without an accent. For instance, seminário
> and seminario would be the same word. I'll also check the things you said
> about rundig.

To treat accented and unaccented letters as equivalent, you should install
Robert Marchand's patch that adds an "accents" fuzzy match algorithm, and
then build your accents database with a "htfuzzy accents" command every
time you reindex or update your word database (this too can be added to
the rundig script if you want). You'll also need to add the "accents"
algorithm to your search_algorithm attribute. The patch is available at

    ftp://ftp.ccsf.org/htdig-patches/3.1.5/accents.5

and can be installed, in your main 3.1.5 source directory, using the
command "patch -p1 < accents.5", followed by ./configure and make.

> Once again thank you.
>
> Best regards,
>
> Manuel
>
>
>
> ----- Original Message -----
> From: "Gilles Detillieux" <grdetil@scrc.umanitoba.ca>
> To: "Manuel Monteiro" <nelo@astro.up.pt>
> Cc: <htdig@htdig.org>
> Sent: Thursday, September 21, 2000 5:46 PM
> Subject: Re: [htdig] Two languages and accentuated words
>
>
> > According to Manuel Monteiro:
> > > > The files you mention are normally in your "common" directory.
> > > > The db.wordlist file should be in your "db" directory, as defined by
> > > > the database_dir attribute.
> > >
> > > I've checked the db file but the word 'Seminário' is not present. This
> is
> > > valid for all accentuated words.
> > >
> > > I've tried to use both en_US and en_US.ISO8859-1 without success. I' ll
> try
> > > to learn how to add another locale setting.
> > > (After changing anything in the config file i run rundig, must i do
> anything
> > > else?)
> >
> > Just running rundig should be sufficient, if you're running the standard
> > rundig script that came with the package. The only little snag, when
> > you're running a setup for a different language, is that rundig contains
> > hardcoded references to the standard english.0 and synonyms dictionaries.
> > So, it may not run the "htfuzzy endings" and "htfuzzy synonyms" commands
> > the first time they're needed (or it may needlessly run them each time -
> > I'm not sure which behaviour will occur, but it would likely depend on
> > whether you ran rundig once before customising for another language).
> > That is fairly easily remedied, either by running the commands once
> > manually, and/or modifying the script to correct this problem.
> >
> > So, is htdig splitting all accented words at the letter with an accent?
> > I.e., for 'seminário', is it making two entries in db.wordlist for 'semin'
> > and 'rio'? If so, it is treating the accented letter as punctuation
> > (actually as a control character, which htdig processes just like
> > punctuation), which is the standard behaviour when your locale is not
> > set up correctly.
> >
> > I find it unusual, but not unbelievable, that the en_US locale doesn't
> > handle accents properly. On most systems, and certainly on glibc-based
> > Linux systems, most western-European locales and the en_US locale all
> > use the same LC_CTYPE map, which recognises all ISO-8859-1 (Latin 1)
> > accented characters as letters. However, there are some systems that
> > impose a stricter and more language-specific ctype map. I forget
> > which system it was, but there was one in which the fr_FR recognised
> > only accented letters that are actually used in the french language,
> > so that a letter like 'á' was not recognised as a letter in that locale.
> > On some systems, such as most libc5-based Linux systems, locale support
> > seems to be hopelessly broken, so no locale will give proper support
> > for accents. So, as the saying goes, your mileage may vary.
> >
> > I don't believe you ever mentioned, in any of your e-mails, which OS
> > you're running, or which version and distribution. If you do, maybe
> > someone on the list with a similar system can shed some light as to how
> > to get locale support working, if indeed that is possible on your system.
> > (I've given up on locale support on my old Red Hat 4.2 system, which
> > uses a broken libc5 C library.) Also, if you're running htdig from
> > an RPM distribution, and you installed the wrong build, that may cause
> > locale problems even if they work on your system.
> >
> > --
> > Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> > Spinal Cord Research Centre WWW:
> http://www.scrc.umanitoba.ca/~grdetil
> > Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> > Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
> >
>

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Fri Sep 22 2000 - 10:05:48 PDT