Re: htdig: [Patch] non english text parser broken


Nuno Grilo (nmg@publico.publico.pt)
Thu, 5 Nov 1998 11:24:17 +0000 (WET)


On Thu, 5 Nov 1998, Vadim Chekan wrote:

>
> -----Original Message-----
> From: Nuno Grilo <nmg@publico.publico.pt>
> To: htdig@sdsu.edu <htdig@sdsu.edu>
> Date: 4 листопада 1998 р. 21:33
> Subject: Re: htdig: [Patch] non english text parser broken
>
>
> >
> >
> >On Wed, 4 Nov 1998, Geoff Hutchison wrote:
> >
> >> At 9:08 AM -0500 11/4/98, Vadim Chekan wrote:
> >>
> >> >I found a bug in current (3.1.0.b2) release: I can't index text cyrillic
> >> >files. This is because of declare "char" instead of "unsigned char".
> >> >Function "isalpha" doesn't work with char>127.
> >>
> >> Is this just a problem with text files? In other words, is the problem
> with
> >> the Plaintext parser, or also with the HTML parser?
> >>
> >I get no matches for non-english words in html documents with or without
> >the patch. This is in Digital Unix 4.0
>
>
> 1. Did you insert in your configuration file "locale: xxx" line?
> For my russian on FreeBSD for example:
> locale: ru_RU.KOI8-R

No, but the enconding i'm using is iso-latin1 which is the default.
The non-ascii characters are all in iso-latin1, not html entities.
I'm using the same configuration file I used for 3.1.0b1 and things
worked fine in 3.1.0b1

> 2. Do you know in which encoding you get html pages from http server?
> Is this encoding match with described in "locale"?
> For example, in russia exist different encoding for cyrillic charset and I
> use "Russia Apache" http://apache.lexa.ru which can works with several
> encoding.

I'm using apache's default, which I think is iso-latin1

> 3. You can whether check up it works by looking in the db.wordlist
> Until you don't have there non-english words, you have problem.
>
> I have 3.1.0b2.
> HTML indexing works fine and only text files needs this patch.
> You are a second man who ask me about non-english indexing.
>
>
> Vadim Chekan.
> SysAdm "Galitsky Kontrakty" newspaper

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:45 PST