Re: [htdig] indexing dem cyrillic letters along w/ latin ones

Subject: Re: [htdig] indexing dem cyrillic letters along w/ latin ones
From: Max Pyziur (
Date: Sat Dec 09 2000 - 09:30:09 PST

over a year and a 1/2 ago the dialog went thusly:
> According to Max Pyziur:
> >Greetings All,
> >
> >I'm still a newbie to ht://dig. I've installed it both on my home Linux
> >box (RPMs on RedHat 5.2) and on our server (running Solaris 2.6; yes, had
> >to find the necessary libstdc++ library and get a copy of gnu-make; the
> >address of the website is; our first uses of ht://dig
> >can be found at I'm still in testing
> >mode and haven't begun to try and index the whole server, just one or two
> >directories. Our problem is that our website is trilingual - more than 50%
> >English, the rest mostly Ukrainian, with a bit here and there in Russian.
> >The other problem is that the Character set we're using for the Ukrainian
> >and Russian language pages is CP1251, not KOI8 (the Unix guy's and gal's
> >favorite). This is because CP1251 exists in one form whereas KOI8 exists
> >in several (KOI8-R, KOI8-U, KOI8-RU), all overlapping on a core set of
> >characters, but differing on about five or six, making use of any variant
> >of KOI8 just a bit unnerving.
> >
> >I've seen the references to dictionaries available at
> > and
> >have picked up the Russian one ( pretty much of a cinch to change Russian
> >koi8 to cp1251); however, does anyone know of Ukrainian dictionairies?

> Not yet - Ukrainian is not very widely used in other countries than Ukrania
> itself, I think. Maybe you can get some information at the computer sience
> or math divisions of Ukranian universities? At least this would be where I
> tried to look for this since there is a good chance that the people there
> could use iSpell for checking TeX documents.
> >Last, do the compilations of ht://dig have to be done separately for each
> >language (clearly a newbie question).
> No. Setting the "locale" directive in the configuration file should be
> sufficient.

Sometime around the end of 1999 there was a Ukrainian dictionary which appeared
on a server in Ukraine. It is in the KOI8 encoding. You can find it here:
or here:

I downloaded it, wrote a perl script for converting it to cp1251 (available on
my website) and converted the dictionary to cp1251.

I'll also make both things available at for those who might be

I also setup a Ukrainian language locale on my RH6.2 server using the following
localedef -c -f CP1251 -i uk_UA -u mnemonic.ds /usr/share/locale/uk_UA.cp1251

I then put the following lines in my conf files
locale: uk_UA.cp1251
lang_dir: ${common_dir}/ukrainian
bad_words_list: ${lang_dir}/ukr_badwords
endings_affix_file: ${lang_dir}/ukrainian.aff

The funny thing (head scratching) is that I'm not totally convinced that the
dictionary is necessary. I mean there are about 40,000 words in the dictionary,
but I can use case insensitive search terms for words which don't occur there.
I guess this is still one of the things which I don't fully understand about the
configuration of htdig.

Anyway, I'm very pleased with the results so far.
> hth,
> Torsten
> --
> InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
> Waldhofstraße 14 Tel: +49-4101-403605
> D-25474 Ellerbek Fax: +49-4101-403606
> E-Mail: Internet:

Max Pyziur                                     BRAMA - Gateway Ukraine                        

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this. List archives: <> FAQ: <>

This archive was generated by hypermail 2b28 : Sat Dec 09 2000 - 15:47:36 PST