Re: SV: [htdig] Foreign chars (Swedish)


Subject: Re: SV: [htdig] Foreign chars (Swedish)
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Fri Nov 26 1999 - 13:33:56 PST


According to Philippe Ramkvist-Henry:
> On Thu, 25 Nov 1999, Gilles Detillieux wrote:
> > OK, so the word Ättestupan appears in there as ättestupan, correct?
> > Very strange. So searches for words containing Ä will find words with
> > ä in its place, as expected, but searches for words containing ä will
> > match neither ä nor Ä, is that right? I'm at a bit of a loss to explain
> > it, but at some point it would seem that htsearch is mangling the lower
> > case ä. Do you have any documents containing a lower case ä somewhere
> > in a word, and if so, does that word make it into db.wordlist correctly?
>
> All correct and the words make it into the db.wordlist correctly.
> Example:
>
> anlände i:269 l:150 w:1652 c:2 a:4
> anlände i:475 l:285 w:715
> anlände i:581 l:295 w:705 a:1
> anlände i:586 l:394 w:606
> anländer i:146 l:466 w:534
> anländer i:282 l:466 w:534
>
> and
>
> äter i:576 l:606 w:394 a:14
> ätit i:531 l:603 w:397
> ätit i:586 l:636 w:364
> ättestupan i:109 l:558 w:442
> ättestupan i:126 l:465 w:535

That all looks the way it should, as far as I'm concerned. I guess we
need to focus on htsearch, as it appears to be the culprit. (Either that
or htmerge.) Could you try running htsearch from the command line,
and seaching first for ANLÄNDE, and then for anlände? I'd like to see
what it finds in both cases.

> > I still suspect a problem with ctype for your locale. Could you compile
> > and run the following C program on your system, and send me the output?
> > (Run it with the name of your locale, "sv", as an argument.)
>
> Ok, here you go:
>
> su10-6 <6> cc test.c
> su10-6 <7> a.out sv
...
> 224 0xE0: ā -al-n--gt---?
> 225 0xE1: á -al-n--gt---?
> 226 0xE2: â -al-n--gt---?
> 227 0xE3: ã -al-n--gt---?
> 228 0xE4: ä -al-n--gt---?
> 229 0xE5: å -al-n--gt---?
...

OK, your ctype info for the sv locale looks fine. Again, I suspect
htsearch, or possibly a corrupt database. If we can't nail down something
specific in htsearch, or if it's not to difficult to reindex everything
from scratch, I'd suggest you do just that.

> > Does using a locale of sv_SE (or even something entirely like fr or
> > fr_FR) make any difference in your results?
>
> I can't set locale to sv_SE in the htdig.conf file because I get "unknown
> locale". The available (Swedish) locales are:
>
> sv
> sv.ISO8859-15
> sv.ISO8859-15@euro
> sv.UTF-8
> sv.UTF-8@euro

OK, sv is the one you want.

> > do are your documents use ISO 8859-1 (Latin 1) encoding, or are there some
> > that use a 7-bit encoding for Sweden?
>
> Eh, I would guess that all use Latin-1, most indexed documents (99%) are
> plaint HTML files.

Yes, and your db.wordlist looks fine (at least what you showed me),
so it should work, as long as you're also feeding Latin 1 characters
into htsearch. If you are, then it's a bug or a corrupt database (the
index, not the word list).

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930



This archive was generated by hypermail 2b25 : Fri Nov 26 1999 - 13:45:58 PST