Subject: Re: [htdig] indexing a bi-lingual site
From: Gilles Detillieux (firstname.lastname@example.org)
Date: Thu May 11 2000 - 10:35:22 PDT
According to Gerard GACHELIN:
> I'd like to index a bilingual site (french and english) with htdig 3.1.5.
> english and french data are mixed.
> What is the best way to do this ?
Indexing the site should be easy, as long as your system supports locales
correctly. Set your locale in htdig.conf, e.g.:
This should work fine for English, French, and any Western European
The trickier part is supporting the fuzzy match algorithms "endings" and
"synonyms" in more than one language. You could always concatenate
English and French synonym files into one database, and even add
translations of words as synonyms, but you'd have to build that up
yourself. As it is, the synonyms file bundled with htdig only contains
alternate spellings and common misspellings of many English words.
It's not a true thesaurus of synonyms.
For the endings algorithm, you could obtain a French dictionary and
affix file, to build a French root2word.db and word2root.db in a separate
directory (or as separate files) from the English ones, and set up the
search form to allow the user to select one or the other. I don't think
you could easily build a dictionary that combines both in one.
You may also want to apply the accents fuzzy match algorithm patch to
add that algorithm as well. This will require running "htfuzzy accents"
after you reindex your site.
Of course, none of the fuzzy algorithms are essential, and exact matches
will work regardless of how you set up the fuzzy algorithms and their
dictionaries and databases.
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Thu May 11 2000 - 08:56:55 PDT