Re[3]: [htdig] A language issue.. Could you give me a favor?


Subject: Re[3]: [htdig] A language issue.. Could you give me a favor?
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Thu Mar 23 2000 - 15:51:07 PST


At 1:54 PM +0900 3/23/00, Oskar Bartenstein wrote:
>Yes in general a character is not a byte. Still dont see,
>at least for clean encodings like EUC, where this difference
>should break the workings of htdig?

No, 8-bit encodings will probably work fine. But is there actually an
8-bit encoding for Chinese?

> > >A correct HTML page includes info about its encoding, therefore
> > >htdig on the receiving end can convert it to any code it likes.
> >
> > Yes, provided that it has code to convert from one encoding into
> > another. :-) This is the crux of the problem.
>
>I would use an external converter. There is good code, e.g.
>nkf, tcs, many others. See http://ftp.monash.edu.au/pub/nihongo/

I would disagree--I'd use library code like iconv() that's in later
versions of the glibc. This has recently been packaged with some
other nice UTF8/Unicode support into a separate, platform-independent
library.

>A person who carefully serves an international audience will include
>something like this example for EUC:
><META HTTP-EQUIV="Content-Type" CONTENT="text/html;CHARSET=x-euc-jp">
>to allow a browser to display the page properly.

Sure, but as you say, this requires the HTML.cc parser to read those
tags. :-) It also needs to recognize them when they come from the
server itself.

>3 - Determine if the cgi input is understood by htsearch as it is,
> or also needs special attention?

I don't know, it would require considerable testing.

I guess my point is that you can push htdig into other character
sets, but this isn't the best solution all-around. I've seen it work
on an 8-bit Korean charset (I don't remember what that was), but it
should really have built-in charset conversion and full
wide-character support. This would help considerably, esp. in a few
languages (Russian springs to mind) where people serve the same page
in multiple character sets.

Regards,

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu Mar 23 2000 - 14:52:38 PST