Re[2]: [htdig] A language issue.. Could you give me a favor?

Subject: Re[2]: [htdig] A language issue.. Could you give me a favor?
From: Oskar Bartenstein (
Date: Wed Mar 22 2000 - 16:38:12 PST

Hi internationals, nationals and locals:

There is 4 seperate issues:
- languages (english, mandarin, cantonese)
- fonts for presentation (armanis *.ttf files)
- character encodings (e.g. EUC, Big-5...)
- algorithms of htdig

Now ignoring all presentation issues (fonts, output html tags etc),
and ignoring the language issues (fuzzy search, bad words lists)
and ignoring browser issues (how to understand a keyword that the
browser sent), leaves the character encoding.

A correct HTML page includes info about its encoding, therefore
htdig on the receiving end can convert it to any code it likes.

If htdig uses a character encoding like EUC that is context
independent and coexists with seven-bit single-byte characters,
what actually prevents htdig from doing its thing?

Boils down to 2 questions (sorry I never looked at the source code):
        - is htdig 8-bit clean?
        - is htdig words and dictionaries sequences of bytes?
If both is yes, then I would guess the core is ok,
and we only have to look at how to use it properly.
Hope I did not overlook a parsing issue.


Wed, 22 Mar 2000 09:06:16 -0600 Geoff Hutchison
<> said:
> At 6:48 AM +0800 3/22/00, armani wrote:
> >After I build htdig this search engines, it will work fastly on my
> >web server except Chinese words.
> The problem is that Chinese words (and many other languages) use
> multi-byte characters. Currently, ht://Dig does not support
> multi-byte characters, so it cannot be used to index Chinese.

Dr. Oskar Bartenstein       
IF Computer Japan               

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Wed Mar 22 2000 - 15:38:23 PST