Re: htdig: [Patch] non english text parser broken


Vadim Chekan (vadim@gc.lviv.ua)
Thu, 5 Nov 1998 10:44:50 +0200


-----Original Message-----
From: Nuno Grilo <nmg@publico.publico.pt>
To: htdig@sdsu.edu <htdig@sdsu.edu>
Date: 4 листопада 1998 р. 21:33
Subject: Re: htdig: [Patch] non english text parser broken

>
>
>On Wed, 4 Nov 1998, Geoff Hutchison wrote:
>
>> At 9:08 AM -0500 11/4/98, Vadim Chekan wrote:
>>
>> >I found a bug in current (3.1.0.b2) release: I can't index text cyrillic
>> >files. This is because of declare "char" instead of "unsigned char".
>> >Function "isalpha" doesn't work with char>127.
>>
>> Is this just a problem with text files? In other words, is the problem
with
>> the Plaintext parser, or also with the HTML parser?
>>
>I get no matches for non-english words in html documents with or without
>the patch. This is in Digital Unix 4.0

1. Did you insert in your configuration file "locale: xxx" line?
For my russian on FreeBSD for example:
locale: ru_RU.KOI8-R

2. Do you know in which encoding you get html pages from http server?
Is this encoding match with described in "locale"?
For example, in russia exist different encoding for cyrillic charset and I
use "Russia Apache" http://apache.lexa.ru which can works with several
encoding.

3. You can whether check up it works by looking in the db.wordlist
Until you don't have there non-english words, you have problem.

I have 3.1.0b2.
HTML indexing works fine and only text files needs this patch.
You are a second man who ask me about non-english indexing.

Vadim Chekan.
SysAdm "Galitsky Kontrakty" newspaper

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:45 PST