Re: SV: [htdig] Foreign chars (Swedish)


Subject: Re: SV: [htdig] Foreign chars (Swedish)
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Nov 25 1999 - 13:27:05 PST


According to Philippe Ramkvist-Henry:
> > Are the hits all capitalized, or do some of them have the lowercase ä?
> > Does this problem happen consistently with certain accented letters, and
> > not others? Do you have certain uppercase letters appearing in db.wordlist?
>
> With hits you mean the actual words from the document I guess. Well only those
> which are supposed to be capitalized are. For example: A search for "ättestupan"
> renders 0 hits while a search for "Ättestupan" renders 18. The word is in the documents
> always written as "Ättestupan" so this would be natural if the search was case sensitive.
> The problem is that "Åsa" and "åsa" gives the exact same hits and it's also always
> reffered to as "Åsa". The problem only exists (as far as I can test) for "äÄ".
>
> The db.wordlist only contain lowercase letters.

OK, so the word Ättestupan appears in there as ättestupan, correct?
Very strange. So searches for words containing Ä will find words with
ä in its place, as expected, but searches for words containing ä will
match neither ä nor Ä, is that right? I'm at a bit of a loss to explain
it, but at some point it would seem that htsearch is mangling the lower
case ä. Do you have any documents containing a lower case ä somewhere
in a word, and if so, does that word make it into db.wordlist correctly?

I still suspect a problem with ctype for your locale. Could you compile
and run the following C program on your system, and send me the output?
(Run it with the name of your locale, "sv", as an argument.)

Does using a locale of sv_SE (or even something else entirely like fr or
fr_FR) make any difference in your results? And for the long-shot question,
do are your documents use ISO 8859-1 (Latin 1) encoding, or are there some
that use a 7-bit encoding for Sweden?

-----------------------
#include <ctype.h>
#include <locale.h>

main(int ac, char **av)
{
        int i;
        unsigned char c;

        if (ac > 1) setlocale(LC_ALL, av[1]);

        for (i = 0; i < 256; ++i) {
                printf("%3d 0x%02X: ", i, i);
                c = i;
                if (isprint(c))
                        printf(" %c", c);
                else if (c < 0x80 && isprint(c ^ '@'))
                        printf("^%c", c ^ '@');
                else if (isprint((c & 0x7F) ^ '@'))
                        printf("~%c", (c & 0x7F) ^ '@');
                else
                        printf(" ");
                printf(" %c%c%c%c%c%c%c%c%c%c%c%c%c\n",
                        isascii(c) ? 'A' : '-',
                        isalpha(c) ? 'a' : '-',
                        islower(c) ? 'l' : '-',
                        isupper(c) ? 'u' : '-',
                        isalnum(c) ? 'n' : '-',
                        isdigit(c) ? 'd' : '-',
                        isxdigit(c) ? 'x' : '-',
                        isgraph(c) ? 'g' : '-',
                        isprint(c) ? 't' : '-',
                        ispunct(c) ? 'p' : '-',
                        iscntrl(c) ? 'c' : '-',
                        isspace(c) ? 's' : '-',
#ifdef isblank
                        isblank(c) ? 'b' : '-'
#else
                        '?'
#endif
                        );
        }
}
-----------------------

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You'll receive a message confirming the unsubscription.



This archive was generated by hypermail 2b25 : Thu Nov 25 1999 - 13:38:56 PST