Re[3]: [htdig] A language issue.. Could you give me a favor?


Subject: Re[3]: [htdig] A language issue.. Could you give me a favor?
From: Oskar Bartenstein (oskar@ifcomputer.co.jp)
Date: Wed Mar 22 2000 - 20:54:07 PST


So this is promising - great.

Wed, 22 Mar 2000 20:59:06 -0600 Geoff Hutchison
<ghutchis@wso.williams.edu> said:

> It is 8-bit clean, but it treats characters as synonymous with 8
> bits. Many parts of the code (the String class in particular) assume
> that a character is only 1 byte and keeps going. In many encodings,
> this is *not* the case, and so you're stuck.

Yes in general a character is not a byte. Still dont see,
at least for clean encodings like EUC, where this difference
should break the workings of htdig?

> >A correct HTML page includes info about its encoding, therefore
> >htdig on the receiving end can convert it to any code it likes.
>
> Yes, provided that it has code to convert from one encoding into
> another. :-) This is the crux of the problem.

I would use an external converter. There is good code, e.g.
nkf, tcs, many others. See http://ftp.monash.edu.au/pub/nihongo/

> Currently ht://Dig
> assumes the host system has working locale support and is getting the
> pages in the default encoding of the system. If they're not, it
> assumes they are anyway. :-) It makes no attempt to convert character
> encodings.

> Basically, if you have an Latin-1 encoding for your character-set,
> you're OK. That's the limit of the current i18n.

To my best knowledge, one HTML page can only have one encoding,
but a web server can serve international pages with many encodings.
These have nothing to do with the encoding used on the machine
which runs the search engine.

A person who carefully serves an international audience will include
something like this example for EUC:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;CHARSET=x-euc-jp">
to allow a browser to display the page properly.

Leaves 3 tasks:
1 - Convince htdig to read these tags to get the encoding
    of the incoming page.
2 - Find a good place to attach an external converter to filter
    incoming pages.
3 - Determine if the cgi input is understood by htsearch as it is,
    or also needs special attention?

Oskar
# armani would not need to wait for (1) since they know the encoding.
# if pages are served in EUC, I believe you can skip (2).

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Mar 22 2000 - 19:52:52 PST