Subject: Re: [htdig] A language issue.. Could you give me a favor?
From: Oskar Bartenstein (firstname.lastname@example.org)
Date: Wed Mar 22 2000 - 16:38:12 PST
Hi internationals, nationals and locals:
There is 4 seperate issues:
- languages (english, mandarin, cantonese)
- fonts for presentation (armanis *.ttf files)
- character encodings (e.g. EUC, Big-5...)
- algorithms of htdig
Now ignoring all presentation issues (fonts, output html tags etc),
and ignoring the language issues (fuzzy search, bad words lists)
and ignoring browser issues (how to understand a keyword that the
browser sent), leaves the character encoding.
A correct HTML page includes info about its encoding, therefore
htdig on the receiving end can convert it to any code it likes.
If htdig uses a character encoding like EUC that is context
independent and coexists with seven-bit single-byte characters,
what actually prevents htdig from doing its thing?
Boils down to 2 questions (sorry I never looked at the source code):
- is htdig 8-bit clean?
- is htdig words and dictionaries sequences of bytes?
If both is yes, then I would guess the core is ok,
and we only have to look at how to use it properly.
Hope I did not overlook a parsing issue.
Wed, 22 Mar 2000 09:06:16 -0600 Geoff Hutchison
> At 6:48 AM +0800 3/22/00, armani wrote:
> >After I build htdig this search engines, it will work fastly on my
> >web server except Chinese words.
> The problem is that Chinese words (and many other languages) use
> multi-byte characters. Currently, ht://Dig does not support
> multi-byte characters, so it cannot be used to index Chinese.
-- Dr. Oskar Bartenstein email@example.com IF Computer Japan www.ifcomputer.com
------------------------------------ To unsubscribe from the htdig mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Wed Mar 22 2000 - 15:38:23 PST