Re[4]: [htdig] A language issue.. Could you give me a favor?


Subject: Re[4]: [htdig] A language issue.. Could you give me a favor?
From: Oskar Bartenstein (oskar@ifcomputer.co.jp)
Date: Thu Mar 23 2000 - 18:57:53 PST


Thu, 23 Mar 2000 17:51:07 -0600 Geoff Hutchison <ghutchis@wso.williams.edu> said:

> No, 8-bit encodings will probably work fine. But is there actually an
> 8-bit encoding for Chinese?

There is EUC for Chinese, Korean, Japanese.
Maybe we are not talking about the same thing: these are not 8bit encodings;
all three above languages have many more that 256 characters.
But EUC produces byte strings where the 8th bit distiguishes
between multibyte or not, asciis stay asciis, sequence stays sequence,
and no context information is needed.
So if htdig is 8-bit clean and only chops byte strings
and does not e.g. shuffle bytes, then the hard part should be ok.
An EUC 16bit character chopped in half would be imperfect,
but still a valid first or last byte in a string used for
fgrep-style string matching.

> >I would use an external converter. There is good code, e.g.
> >nkf, tcs, many others. See http://ftp.monash.edu.au/pub/nihongo/
>
> I would disagree--I'd use library code like iconv() that's in later
> versions of the glibc. This has recently been packaged with some
> other nice UTF8/Unicode support into a separate, platform-independent
> library.

Just fine. I used "external" as "written by somesmartbody else".
http://clisp.cons.org/~haible/packages-libiconv.html
looks good enough to me.

Where do I have to plug it into the htdig and htsearch code?

> >A person who carefully serves an international audience will include
> >something like this example for EUC:
> ><META HTTP-EQUIV="Content-Type" CONTENT="text/html;CHARSET=x-euc-jp">
> >to allow a browser to display the page properly.
>
> Sure, but as you say, this requires the HTML.cc parser to read those
> tags. :-)
Is there anybody who knows where and how to do this - and would?
This is not needed for people who htdig their own site, because
they know their encoding, but would match the power of iconv.

> It also needs to recognize them when they come from the
> server itself.
Not sure what you mean. Could you clarify?

> >3 - Determine if the cgi input is understood by htsearch as it is,
> > or also needs special attention?
> I don't know, it would require considerable testing.
Ok, I believe there is a few people on this list who would lend
a hand. The topic has been popping up every few month.

> I guess my point is that you can push htdig into other character
> sets, but this isn't the best solution all-around. I've seen it work
> on an 8-bit Korean charset (I don't remember what that was), but it
> should really have built-in charset conversion and full
> wide-character support. This would help considerably, esp. in a few
> languages (Russian springs to mind) where people serve the same page
> in multiple character sets.

As a next step, would you suggest good places to plug iconv
into htdig and htsearch. Then I will try to do some testing with
Netscape and Internet Explorer as browsers, EUC for htdig internal,
and run it against SJIS (Microsoft) and EUC web sites.

Oskar

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu Mar 23 2000 - 17:54:29 PST