loic@ceic.com
Tue, 3 Aug 1999 11:58:24 +0200 (MEST)
>
> > * UTF-8/Unicode support ?
>
> Unless I hear someone volunteer for this ASAP, I'm cutting this from the
> list of 3.2 goals.
>
> > * Character-Set translation ?
>
> This doesn't need to be hard--just use HtWordCodec to load in
> translation tables. But it depends on the decision for the above...
>
The two are indeed related. The iconv/iconvdata functions and tables
that come with glibc-2.1 provide the most complete set I've found. Here is
the list of conversion tables available:
437, 500, 500V1, 850, 851, 852, 855, 857, 860, 861, 862, 863, 864, 865, 866,
869, 874, 904, 1026, 1047, 8859_1, 8859_2, 8859_3, 8859_4, 8859_5, 8859_6,
8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/UCS4, ANSI_X3.4-1968,
ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110, ARABIC, ARABIC7,
ASCII, ASMO-708, ASMO_449, BALTIC, BIG-5, BIG-FIVE, BIG5, BIGFIVE, BS_4730,
CA, CN, CP-AR, CP-GR, CP-HU, CP037, CP038, CP273, CP274, CP275, CP278, CP280,
CP281, CP284, CP285, CP290, CP297, CP367, CP420, CP423, CP424, CP437, CP500,
CP737, CP775, CP819, CP850, CP851, CP852, CP855, CP857, CP860, CP861, CP862,
CP863, CP864, CP865, CP866, CP868, CP869, CP870, CP871, CP874, CP875, CP880,
CP891, CP903, CP904, CP905, CP918, CP932, CP949, CP1004, CP1026, CP1047,
CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257, CP1258,
CP1361, CPIBM861, CSA7-1, CSA7-2, CSASCII, CSA_T500-1983, CSA_T500,
CSA_Z243.4-1985-1, CSA_Z243.4-1985-2, CSDECMCS, CSEBCDICATDE, CSEBCDICATDEA,
CSEBCDICCAFR, CSEBCDICDKNO, CSEBCDICDKNOA, CSEBCDICES, CSEBCDICESA,
CSEBCDICESS, CSEBCDICFISE, CSEBCDICFISEA, CSEBCDICFR, CSEBCDICIT, CSEBCDICPT,
CSEBCDICUK, CSEBCDICUS, CSEUCKR, CSEUCPKDFMTJAPANESE, CSHPROMAN8, CSIBM037,
CSIBM038, CSIBM273, CSIBM274, CSIBM275, CSIBM277, CSIBM278, CSIBM280,
CSIBM281, CSIBM284, CSIBM285, CSIBM290, CSIBM297, CSIBM420, CSIBM423,
CSIBM424, CSIBM599, CSIBM851, CSIBM855, CSIBM857, CSIBM860, CSIBM863,
CSIBM864, CSIBM865, CSIBM866, CSIBM868, CSIBM869, CSIBM870, CSIBM871,
CSIBM880, CSIBM891, CSIBM903, CSIBM904, CSIBM905, CSIBM918, CSIBM1026,
CSISO4UNITEDKINGDOM, CSISO10SWEDISH, CSISO11SWEDISHFORNAMES,
CSISO14JISC6220RO, CSISO15ITALIAN, CSISO16PORTUGESE, CSISO17SPANISH,
CSISO18GREEK7OLD, CSISO19LATINGREEK, CSISO21GERMAN, CSISO25FRENCH,
CSISO27LATINGREEK1, CSISO49INIS, CSISO50INIS8, CSISO51INISCYRILLIC,
CSISO58GB1988, CSISO60DANISHNORWEGIAN, CSISO60NORWEGIAN1, CSISO61NORWEGIAN2,
CSISO69FRENCH, CSISO84PORTUGUESE2, CSISO85SPANISH2, CSISO86HUNGARIAN,
CSISO88GREEK7, CSISO89ASMO449, CSISO90, CSISO92JISC62991984B, CSISO99NAPLPS,
CSISO103T618BIT, CSISO111ECMACYRILLIC, CSISO121CANADIAN1, CSISO122CANADIAN2,
CSISO139CSN369103, CSISO141JUSIB1002, CSISO143IECP271, CSISO150,
CSISO150GREEKCCITT, CSISO151CUBA, CSISO153GOST1976874, CSISO646DANISH,
CSISO2022JP, CSISO2022JP2, CSISO2022KR, CSISO2033, CSISO5427CYRILLIC,
CSISO5427CYRILLIC1981, CSISO5428GREEK, CSISO10367BOX, CSISOLATIN1,
CSISOLATIN2, CSISOLATIN3, CSISOLATIN4, CSISOLATIN5, CSISOLATIN6,
CSISOLATINARABIC, CSISOLATINCYRILLIC, CSISOLATINGREEK, CSISOLATINHEBREW,
CSKOI8R, CSKSC5636, CSMACINTOSH, CSNATSDANO, CSNATSSEFI, CSN_369103,
CSPC8CODEPAGE437, CSPC8LATINHEBREW, CSPC8MULTILINGUAL, CSPC775BALTIC,
CSPCP852, CSSHIFTJIS, CUBA, CWI-2, CWI, CYRILLIC, DE, DEC-MCS, DEC,
DIN_66003, DK, DS2089, DS_2089, E13B, EBCDIC-AT-DE-A, EBCDIC-AT-DE,
EBCDIC-BE, EBCDIC-BR, EBCDIC-CA-FR, EBCDIC-CP-AR1, EBCDIC-CP-AR2,
EBCDIC-CP-BE, EBCDIC-CP-CA, EBCDIC-CP-CH, EBCDIC-CP-DK, EBCDIC-CP-ES,
EBCDIC-CP-FI, EBCDIC-CP-FR, EBCDIC-CP-GB, EBCDIC-CP-GR, EBCDIC-CP-HE,
EBCDIC-CP-IS, EBCDIC-CP-IT, EBCDIC-CP-NL, EBCDIC-CP-NO, EBCDIC-CP-ROECE,
EBCDIC-CP-SE, EBCDIC-CP-TR, EBCDIC-CP-US, EBCDIC-CP-WT, EBCDIC-CP-YU,
EBCDIC-CYRILLIC, EBCDIC-DK-NO-A, EBCDIC-DK-NO, EBCDIC-ES-A, EBCDIC-ES-S,
EBCDIC-ES, EBCDIC-FI-SE-A, EBCDIC-FI-SE, EBCDIC-FR, EBCDIC-GREEK, EBCDIC-INT,
EBCDIC-INT1, EBCDIC-IS-FRISS, EBCDIC-IT, EBCDIC-JP-E, EBCDIC-JP-KANA,
EBCDIC-PT, EBCDIC-UK, EBCDIC-US, ECMA-114, ECMA-118, ECMA-CYRILLIC, ELOT_928,
ES, ES2, EUC-CN, EUC-JP, EUC-KR, EUC-TW, EUCCN, EUCJP, EUCKR, EUCTW, FI, FR,
GB, GB_1988-80, GOST_19768-74, GOST_19768, GREEK-CCITT, GREEK, GREEK7-OLD,
GREEK7, GREEK8, HEBREW, HP-ROMAN8, HU, IBM037, IBM038, IBM256, IBM273,
IBM274, IBM275, IBM277, IBM278, IBM280, IBM281, IBM284, IBM285, IBM290,
IBM297, IBM367, IBM420, IBM423, IBM424, IBM437, IBM500, IBM775, IBM819,
IBM850, IBM851, IBM852, IBM855, IBM857, IBM860, IBM861, IBM862, IBM863,
IBM864, IBM865, IBM866, IBM868, IBM869, IBM870, IBM871, IBM875, IBM880,
IBM891, IBM903, IBM904, IBM905, IBM918, IBM1004, IBM1026, IBM1047, IEC_P27-1,
INIS-8, INIS-CYRILLIC, INIS, ISO-2022-JP-2, ISO-2022-JP, ISO-2022-KR,
ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6,
ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13,
ISO-8859-14, ISO-8859-15, ISO-10646, ISO-10646/UCS2, ISO-10646/UCS4,
ISO-10646/UTF-?8, ISO-10646/UTF8, ISO-IR-4, ISO-IR-6, ISO-IR-8-1, ISO-IR-9-1,
ISO-IR-10, ISO-IR-11, ISO-IR-14, ISO-IR-15, ISO-IR-16, ISO-IR-17, ISO-IR-18,
ISO-IR-19, ISO-IR-21, ISO-IR-25, ISO-IR-27, ISO-IR-37, ISO-IR-49, ISO-IR-50,
ISO-IR-51, ISO-IR-54, ISO-IR-55, ISO-IR-57, ISO-IR-60, ISO-IR-61, ISO-IR-69,
ISO-IR-84, ISO-IR-85, ISO-IR-86, ISO-IR-88, ISO-IR-89, ISO-IR-90, ISO-IR-92,
ISO-IR-98, ISO-IR-99, ISO-IR-100, ISO-IR-101, ISO-IR-103, ISO-IR-109,
ISO-IR-110, ISO-IR-111, ISO-IR-121, ISO-IR-122, ISO-IR-126, ISO-IR-127,
ISO-IR-138, ISO-IR-139, ISO-IR-141, ISO-IR-143, ISO-IR-144, ISO-IR-148,
ISO-IR-150, ISO-IR-151, ISO-IR-153, ISO-IR-155, ISO-IR-156, ISO-IR-157,
ISO-IR-166, ISO-IR-179, ISO-IR-197, ISO646-CA, ISO646-CA2, ISO646-CN,
ISO646-CU, ISO646-DE, ISO646-DK, ISO646-ES, ISO646-ES2, ISO646-FI, ISO646-FR,
ISO646-FR1, ISO646-GB, ISO646-HU, ISO646-IT, ISO646-JP-OCR-B, ISO646-JP,
ISO646-KR, ISO646-NO, ISO646-NO2, ISO646-PT, ISO646-PT2, ISO646-SE,
ISO646-SE2, ISO646-US, ISO646-YU, ISO6937, ISO_646.IRV:1991, ISO_2033-1983,
ISO_2033, ISO_5427-EXT, ISO_5427, ISO_5427:1981, ISO_5428, ISO_5428:1980,
ISO_6937-2, ISO_6937-2:1983, ISO_6937, ISO_6937:1992, ISO_8859-1,
ISO_8859-1:1987, ISO_8859-2, ISO_8859-2:1987, ISO_8859-3, ISO_8859-3:1988,
ISO_8859-4, ISO_8859-4:1988, ISO_8859-5, ISO_8859-5:1988, ISO_8859-6,
ISO_8859-6:1987, ISO_8859-7, ISO_8859-7:1987, ISO_8859-8, ISO_8859-8:1988,
ISO_8859-9, ISO_8859-9:1989, ISO_8859-10, ISO_8859-10:1993, ISO_8859-14:1998,
ISO_8859-15:1998, ISO_9036, ISO_10367-BOX, IT, JIS_C6220-1969-RO,
JIS_C6229-1984-B, JOHAB, JP-OCR-B, JP, JS, JUS_I.B1.002, KOI-7, KOI-8,
KOI8-R, KOI8-U, KSC5636, L1, L2, L3, L4, L5, L6, L7, L8, LATIN-GREEK-1,
LATIN-GREEK, LATIN1, LATIN2, LATIN3, LATIN4, LATIN5, LATIN6, LATIN7, LATIN8,
MAC-IS, MAC-UK, MAC, MACINTOSH, MS-ANSI, MS-ARAB, MS-CYRL, MS-EE, MS-GREEK,
MS-TURK, MSCP949, MSCP1361, MSZ_7795.3, MS_KANJI, NAPLPS, NATS-DANO,
NATS-SEFI, NC_NC00-10, NC_NC00-10:81, NF_Z_62-010, NF_Z_62-010_(1973),
NF_Z_62-010_1973, NO, NO2, NS_4551-1, NS_4551-2, OS2LATIN1, OSF00010001,
OSF00010002, OSF00010003, OSF00010004, OSF00010005, OSF00010006, OSF00010007,
OSF00010008, OSF00010009, OSF0001000A, OSF00010020, OSF00010100, OSF00010101,
OSF00010102, OSF00010104, OSF00010105, OSF00010106, OSF00030010, OSF0004000A,
OSF0005000A, OSF05010001, OSF100201A4, OSF100201A8, OSF100201B5, OSF100201F4,
OSF100203B5, OSF1002011C, OSF1002011D, OSF1002035D, OSF1002035E, OSF1002035F,
OSF1002036B, OSF1002037B, OSF10010001, OSF10020025, OSF10020111, OSF10020115,
OSF10020116, OSF10020118, OSF10020122, OSF10020129, OSF10020352, OSF10020354,
OSF10020357, OSF10020359, OSF10020360, OSF10020364, OSF10020365, OSF10020366,
OSF10020367, OSF10020370, OSF10020387, OSF10020388, OSF10020396, OSF10020402,
OSF10020417, PT, PT2, R8, ROMAN8, SE, SE2, SEN_850200_B, SEN_850200_C,
SHIFT-JIS, SJIS, SS636127, ST_SEV_358-88, T.61-8BIT, T.61, TIS-620, TIS620-0,
TIS620.2529-1, TIS620.2533-0, TIS620, UCS-2, UCS-4, UCS2, UCS4, UHC, UJIS,
UK, UNICODE, UNICODEBIG, UNICODELITTLE, US-ASCII, US, UTF-8, UTF8,
WIN-SAMI-2, WINBALTRIM, WS2, YU
I think the internal charset of choice must be UTF8 because it is ascii
compatible and uses 8 bits chars instead of 16 bits chars.
The work involved to use this is merely porting. At present the iconv
functions of glibc are compiled within glibc. They must be ported to
a separate library for portability along with all the string manipulation
routines that are able to deal with UTF8. This requires some work but no
need to actually write code, i.e. no hard debugging process.
If such a library is then distributed on it's own, it is very likely
that many software will use it instead of hand made functions and conversion
tables (The expat XML parser for instance, or the Unicode-* perl modules).
IMHO this is too much work for 3.2, though :-)
Cheers,
-- Loic DacharyECILA 100 av. du Gal Leclerc 93500 Pantin - France Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61 e-mail: Loic@Dachary.org URL: http://www.senga.org/
------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Tue Aug 03 1999 - 02:47:15 PDT