Re: [htdig] indexing dem cyrillic letters along w/ latin ones


Subject: Re: [htdig] indexing dem cyrillic letters along w/ latin ones
From: Max Pyziur (pyz@panix.com)
Date: Sat Dec 09 2000 - 09:30:09 PST


over a year and a 1/2 ago the dialog went thusly:
>
> According to Max Pyziur:
> >Greetings All,
> >
> >I'm still a newbie to ht://dig. I've installed it both on my home Linux
> >box (RPMs on RedHat 5.2) and on our server (running Solaris 2.6; yes, had
> >to find the necessary libstdc++ library and get a copy of gnu-make; the
> >address of the website is http://www.brama.com; our first uses of ht://dig
> >can be found at http://www.brama.com/search.html). I'm still in testing
> >mode and haven't begun to try and index the whole server, just one or two
> >directories. Our problem is that our website is trilingual - more than 50%
> >English, the rest mostly Ukrainian, with a bit here and there in Russian.
> >The other problem is that the Character set we're using for the Ukrainian
> >and Russian language pages is CP1251, not KOI8 (the Unix guy's and gal's
> >favorite). This is because CP1251 exists in one form whereas KOI8 exists
> >in several (KOI8-R, KOI8-U, KOI8-RU), all overlapping on a core set of
> >characters, but differing on about five or six, making use of any variant
> >of KOI8 just a bit unnerving.
> >
> >I've seen the references to dictionaries available at
> >http://fmg-www.cs.ucla.edu/fmg-members/geoff/ispell-dictionaries.html and
> >have picked up the Russian one ( pretty much of a cinch to change Russian
> >koi8 to cp1251); however, does anyone know of Ukrainian dictionairies?

> Not yet - Ukrainian is not very widely used in other countries than Ukrania
> itself, I think. Maybe you can get some information at the computer sience
> or math divisions of Ukranian universities? At least this would be where I
> tried to look for this since there is a good chance that the people there
> could use iSpell for checking TeX documents.
>
> >Last, do the compilations of ht://dig have to be done separately for each
> >language (clearly a newbie question).
>
> No. Setting the "locale" directive in the configuration file should be
> sufficient.

Sometime around the end of 1999 there was a Ukrainian dictionary which appeared
on a server in Ukraine. It is in the KOI8 encoding. You can find it here:
ftp://cad.ntu-kpi.kiev.ua/soft/lingvist/UkrIspell/
or here:
http://www.physics.mcgill.ca/WWW/oleh/emacs/ispell.html

I downloaded it, wrote a perl script for converting it to cp1251 (available on
my website) and converted the dictionary to cp1251.

I'll also make both things available at brama.com for those who might be
interested.

I also setup a Ukrainian language locale on my RH6.2 server using the following
command:
localedef -c -f CP1251 -i uk_UA -u mnemonic.ds /usr/share/locale/uk_UA.cp1251

I then put the following lines in my conf files
locale: uk_UA.cp1251
lang_dir: ${common_dir}/ukrainian
bad_words_list: ${lang_dir}/ukr_badwords
endings_affix_file: ${lang_dir}/ukrainian.aff

The funny thing (head scratching) is that I'm not totally convinced that the
dictionary is necessary. I mean there are about 40,000 words in the dictionary,
but I can use case insensitive search terms for words which don't occur there.
I guess this is still one of the things which I don't fully understand about the
configuration of htdig.

Anyway, I'm very pleased with the results so far.
 
> hth,
> Torsten
>
> --
> InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
> Waldhofstraße 14 Tel: +49-4101-403605
> D-25474 Ellerbek Fax: +49-4101-403606
> E-Mail: info@inwise.de Internet: http://www.inwise.de

-- 
Max Pyziur                                     BRAMA - Gateway Ukraine
pyz@brama.com                                  http://www.brama.com/

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Sat Dec 09 2000 - 15:47:36 PST