Nuno Grilo (firstname.lastname@example.org)
Thu, 5 Nov 1998 11:24:17 +0000 (WET)
On Thu, 5 Nov 1998, Vadim Chekan wrote:
> -----Original Message-----
> From: Nuno Grilo <email@example.com>
> To: firstname.lastname@example.org <email@example.com>
> Date: 4 листопада 1998 р. 21:33
> Subject: Re: htdig: [Patch] non english text parser broken
> >On Wed, 4 Nov 1998, Geoff Hutchison wrote:
> >> At 9:08 AM -0500 11/4/98, Vadim Chekan wrote:
> >> >I found a bug in current (3.1.0.b2) release: I can't index text cyrillic
> >> >files. This is because of declare "char" instead of "unsigned char".
> >> >Function "isalpha" doesn't work with char>127.
> >> Is this just a problem with text files? In other words, is the problem
> >> the Plaintext parser, or also with the HTML parser?
> >I get no matches for non-english words in html documents with or without
> >the patch. This is in Digital Unix 4.0
> 1. Did you insert in your configuration file "locale: xxx" line?
> For my russian on FreeBSD for example:
> locale: ru_RU.KOI8-R
No, but the enconding i'm using is iso-latin1 which is the default.
The non-ascii characters are all in iso-latin1, not html entities.
I'm using the same configuration file I used for 3.1.0b1 and things
worked fine in 3.1.0b1
> 2. Do you know in which encoding you get html pages from http server?
> Is this encoding match with described in "locale"?
> For example, in russia exist different encoding for cyrillic charset and I
> use "Russia Apache" http://apache.lexa.ru which can works with several
I'm using apache's default, which I think is iso-latin1
> 3. You can whether check up it works by looking in the db.wordlist
> Until you don't have there non-english words, you have problem.
> I have 3.1.0b2.
> HTML indexing works fine and only text files needs this patch.
> You are a second man who ask me about non-english indexing.
> Vadim Chekan.
> SysAdm "Galitsky Kontrakty" newspaper
To unsubscribe from the htdig mailing list, send a message to
firstname.lastname@example.org containing the single word "unsubscribe" in
the body of the message.
This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:45 PST