Re: How can I use tcl to read files written in GBK or GB18030 encoding?
- From: suchenwi <richard.suchenwirth-bauersachs@xxxxxxxxxxx>
- Date: Thu, 31 Jan 2008 07:50:48 -0800 (PST)
On 31 Jan., 16:26, "Larry W. Virden" <lvir...@xxxxxxxxx> wrote:
I know that Tcl has quite a large list of encodings that it supports.
However, I've a request for guidance by someone who needs to read
files using either GBK or GB18030 (I think these are alternate names
for the same encoding...).
Has anyone worked out what one needs to do for this?
From Wikipedia I see that GB18030 has a structure vaguely similar toUTF-8, but more complicated:
1st byte 00..7F: ASCII, 1 byte
81..FE: 2nd byte 40..FE: GB2312 Chinese character, 2
bytes
30..39: Extended character, 4
bytes
Extended characters have ranges 81..FE 30..39 81..FE 30..39, with
which all Unicode points are represented that aren't in ASCII or
GB2312. The relation between Unicode and Extended cannot be computed,
but must come from a lookup table.
The difference between GB18030 and GBK is trifle and concerns only the
Euro sign: 0x80 in Microsoft's later versions of GBK and a two byte
code of A2 E3 in GB18030.
http://en.wikipedia.org/wiki/GB_18030 has a link to the "authoritative
mapping table".
Whether Tcl's encoding mechanism can deal with this 1/2/4 byte pattern
directly (so that only an .enc file would have to be produced), I
can't tell. As last resort one might always implement the decision
mechanism sketched above, and use a 2-byte and a 4-byte lookup table.
.
- References:
- How can I use tcl to read files written in GBK or GB18030 encoding?
- From: Larry W. Virden
- How can I use tcl to read files written in GBK or GB18030 encoding?
- Prev by Date: Re: Tcl 8.5 and Expect
- Next by Date: Re: How can I use tcl to read files written in GBK or GB18030 encoding?
- Previous by thread: How can I use tcl to read files written in GBK or GB18030 encoding?
- Next by thread: Re: How can I use tcl to read files written in GBK or GB18030 encoding?
- Index(es):