Re: [htdig] using 2 languages at the same time?


Subject: Re: [htdig] using 2 languages at the same time?
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Nov 02 2000 - 07:54:43 PST


This whole issue was discussed previously in a bit more detail. See the
thread entitled "Endings databases of two languages":

        http://www.htdig.org/mail/2000/05/index.html#223

Unless you work out a way of merging endings databases of two or more
languages, as Geoff and I had previously discussed, you won't get the
endings fuzzy matching to work in multiple languages simultaneously.

Of course, this doesn't prevent you from indexing documents in multiple
languages, as long as they use the same character encoding, ISO-8859-1
in this case. You can even put together combined bad_words lists and
synonyms dictionaries, but if you want the endings algorithm to work in a
multi-lingual setting, you either need to figure out how to combine the
databases, or let the user select the language preference via different
config files.

According to Geoff Hutchison:
> At 4:47 PM +0800 11/2/00, Mathias Körber wrote:
> >a) index pages which may occur in any of 2 or more languages
>
> Well, sure.
>
> >b) automatically identify which language the files are in (no,
> >there is no identifier, this is an email archive which has
> >mails in English, German and a few other languages)
>
> No, I'm afraid not. There isn't much "intelligence" in this regard.
> Even so, you ask a difficult problem--the code would need to
> "recognize" from the text which is one of the harder problems in text
> processing. The HTML standard offers several methods for indicating
> the language of a document, which would help but from what you say,
> these are not used on your pages.
>
> >c) use more than one .aff file, the correct one for each language?
>
> Certainly it would help if ht://Dig kept some metadata for the
> language of a document--this would enable language-specific searches
> and language-specific fuzzy matching as you describe. But this would
> likely be dependent on the META information available in the
> documents themselves.
>
> >The FAQ seems to say that I should create a subdir $COMMON/german
> >and install the german language files there, but that would make the
> >English ones unused, no?
>
> That is correct. Of course you can perform searches on all languages
> at the same time--the only restriction is that most fuzzy algorithms
> won't work well.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Thu Nov 02 2000 - 08:01:20 PST