Re: [htdig] indexing a bi-lingual site

Subject: Re: [htdig] indexing a bi-lingual site
From: Gilles Detillieux (
Date: Thu May 11 2000 - 10:35:22 PDT

According to Gerard GACHELIN:
> I'd like to index a bilingual site (french and english) with htdig 3.1.5.
> english and french data are mixed.
> What is the best way to do this ?

Indexing the site should be easy, as long as your system supports locales
correctly. Set your locale in htdig.conf, e.g.:

locale: fr_FR

This should work fine for English, French, and any Western European

The trickier part is supporting the fuzzy match algorithms "endings" and
"synonyms" in more than one language. You could always concatenate
English and French synonym files into one database, and even add
translations of words as synonyms, but you'd have to build that up
yourself. As it is, the synonyms file bundled with htdig only contains
alternate spellings and common misspellings of many English words.
It's not a true thesaurus of synonyms.

For the endings algorithm, you could obtain a French dictionary and
affix file, to build a French root2word.db and word2root.db in a separate
directory (or as separate files) from the English ones, and set up the
search form to allow the user to select one or the other. I don't think
you could easily build a dictionary that combines both in one.

You may also want to apply the accents fuzzy match algorithm patch to
add that algorithm as well. This will require running "htfuzzy accents"
after you reindex your site.

Of course, none of the fuzzy algorithms are essential, and exact matches
will work regardless of how you set up the fuzzy algorithms and their
dictionaries and databases.

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Thu May 11 2000 - 08:56:55 PDT