Re[2]: [htdig] A language issue.. Could you give me a favor?


Subject: Re[2]: [htdig] A language issue.. Could you give me a favor?
From: Oskar Bartenstein (oskar@ifcomputer.co.jp)
Date: Wed Mar 22 2000 - 16:38:12 PST


Hi internationals, nationals and locals:

There is 4 seperate issues:
- languages (english, mandarin, cantonese)
- fonts for presentation (armanis *.ttf files)
- character encodings (e.g. EUC, Big-5...)
- algorithms of htdig

Now ignoring all presentation issues (fonts, output html tags etc),
and ignoring the language issues (fuzzy search, bad words lists)
and ignoring browser issues (how to understand a keyword that the
browser sent), leaves the character encoding.

A correct HTML page includes info about its encoding, therefore
htdig on the receiving end can convert it to any code it likes.

If htdig uses a character encoding like EUC that is context
independent and coexists with seven-bit single-byte characters,
what actually prevents htdig from doing its thing?

Boils down to 2 questions (sorry I never looked at the source code):
        - is htdig 8-bit clean?
        - is htdig words and dictionaries sequences of bytes?
If both is yes, then I would guess the core is ok,
and we only have to look at how to use it properly.
Hope I did not overlook a parsing issue.

Oskar

Wed, 22 Mar 2000 09:06:16 -0600 Geoff Hutchison
<ghutchis@wso.williams.edu> said:
> At 6:48 AM +0800 3/22/00, armani wrote:
> >After I build htdig this search engines, it will work fastly on my
> >web server except Chinese words.
>
> The problem is that Chinese words (and many other languages) use
> multi-byte characters. Currently, ht://Dig does not support
> multi-byte characters, so it cannot be used to index Chinese.

--
Dr. Oskar Bartenstein                 oskar@ifcomputer.co.jp
IF Computer Japan                         www.ifcomputer.com

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Mar 22 2000 - 15:38:23 PST