Re: [htdig] Endings databases of two languages


Subject: Re: [htdig] Endings databases of two languages
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed May 17 2000 - 10:15:08 PDT


According to Andreas Hudzieczek:
> Therefore, I am now looking for a possibility to sort of have two
> "endings databases".
> Do I need a specific english.0 file, although regular english indexing
> return good endings, whenever I include a secondary language beside
> english?

You can certainly have two different endings databases. The real problem
is that htsearch can't use more than one at a time. You could make it
selectable by the user, I suppose.

I don't know how feasible it would be to produce a "merged" endings
database for two or more languages. I somehow doubt it would be as
trivial as concatenating your .0 dictinary files and your .aff files,
then rerunning htfuzzy endings on the combined files. Can someone
with a better understanding of this algorithm and the affix files
shed any light on this? Has anyone ever tried it?

> I am further assuming that the indexing as well as the endings algos
> will look at the lang_dir variable, but if I want to have two languages
> and their endings, how do I present two lang_dir variables, for example?

None of the programs in the ht://Dig suite will use the lang_dir variable
internally. It's only used as referenced in other attribute definitions,
e.g.:

> bad_word_list: ${lang_dir}/bad_words
> endings_affix_file: ${lang_dir}/german.aff
> endings_dictionary: ${lang_dir}/german.0
> endings_root2word_db: ${lang_dir}/root2word.db
> endings_word2root_db: ${lang_dir}/word2root.db

The indexing phase (htdig and htmerge) will only use the bad_word_list,
which could be made by combining English and German bad word lists. The
other four definitions are all for the endings database, generated by
htfuzzy and used only by htsearch. It is this database that is the tricky
one. You didn't seem to include an alternate synonyms file and database
in your definitions. This two could easily be set up as a combined file,
if you have a German synonyms list to merge in with the English one.

> Btw, I did not state any locale varible (after all, the German endings
> database worked fine without that).
> If anyone thinks that it is necessary, how do I specify two different locales?

The German endings database may be fine, but does your words database
include any umlauts? Generally, unless you explicitly specify a locale,
htdig will treat all accented letters as control characters, and break
up the words at that point. You don't need to specify two different
locales, nor is it possible to do so. This isn't a problem, as most
locales for Western European languages will be pretty much equivalent
as far as htdig is concerned. The main thing is that the LC_CTYPE tables
for your locale recognize all ISO-8859-1 (Latin 1) accented letters as
alphabetic characters. You can probably set your locale to de_DE, and
it should work fine for German and English, and probably other languages
that use the same character encoding.

> Oh, I looked at the FAQ and searched the mailing list, but didn't find
> enough answers to similar questions to solve the puzzle.

Some of these points were discussed on the mailing list just a few days
ago, so unless the archiving isn't working those message should be there.
However, the whole issue of whether the endings database can be set up to
support 2 languages simultaneously is still unresolved. As far as I recall,
you're only the second person to request such a feature.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed May 17 2000 - 08:06:36 PDT