Re: [htdig] Foreign chars (Swedish)


Subject: Re: [htdig] Foreign chars (Swedish)
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Nov 25 1999 - 11:47:41 PST


According to Philippe Ramkvist-Henry:
> I'm having problems with some foreign chars when using htdig to index and
> search a Swedish site. The locale is set right (sv) and is working in
> other applications. The problem I have is somewhat weird, maybe it has
> something to do with "uppercase" "lowercase"?
>
> Well, I can search words like "Åsa,åsa,Öl,öl" and get the same matches.
> But when I try to search "bäst" I get no hits. With "bÄst" I get several
> hits...

Are the hits all capitalized, or do some of them have the lowercase ä?
Does this problem happen consistently with certain accented letters, and
not others? Do you have certain uppercase letters appearing in db.wordlist?

> I asked a guy here a the University and he said that there might be
> complications with "unsigned char" and "char". He gave me the example
> below. Please answer at a novice level, my C++ and Unix knowledge is very
> limited.

Good hunch, but given that some accented letters work and some give
problems, I wouldn't expect that it's a problem with sign extension.
This seems to point to a problem with the ctype tables for your locale,
but there could be something else that I'm missing here. Please keep
us posted.

> htlib/StringMatch.cc
>
> while ((unsigned char)string[pos])
> {
> new_state = table[trans[string[pos]]][state];
>
> Should be? or?
>
> while (string[pos])

You don't need to take off the type cast on the "while" condition above,
but the trans[] array subscript below definitely should be type cast!
I'll fix this in the source. However, this seems to be a problem only
in the StringMatch::Compare() method, which isn't used for looking at
words in documents or in the database. It only affects a few internal
ASCII-only string matches, and the robots.txt disallow comparisons, so
unless you use upper-half characters in URLs, this bug shouldn't be a
problem (which explains how it's evaded detection this long).

> {
> new_state = table[trans[(unsigned char)string[pos]]][state];

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You'll receive a message confirming the unsubscription.



This archive was generated by hypermail 2b25 : Thu Nov 25 1999 - 11:59:33 PST