Re: encoding to and from UTF-8



Chun wrote:
Hello,

I encode a unicode character as utf-8 but how do I convert back to
unicode?

% set a \u0160 << character is Š
¦
% set n [encoding convertto utf-8 $a]
Å % binary scan $n H* hex
1
% puts $hex
c5a0 << correct

Here you've taken Tcl's internal representation of the string
and produced a byte array containing the UTF-8 representation.
(That array's string representation is a sequence of ISO8859-1
characters corresponding to the bytes). [binary scan] is happy
to convert that to hexadecimal.

set n [encoding convertto unicode $n]
Å % binary scan $n H* hex
1
% puts $hex
00c500a0 << Not what I was expecting

Now you've taken the byte array from the previous step
and interpreted it as a string, asking Tcl to convert it
to Unicode. The result is that each of the two bytes
becomes an ISO8859-1 character, and gets encoded as
its Unicode counterpart, so you get two 16-bit characters.
What you probably wanted to do was either to encode
the original string, or decode the byte array back to
a string and then encode it:

% set n [encoding convertto unicode $a]
`
% binary scan $n H* hex
1
% puts $hex
6001

Here you see a single 16-bit character. The bytes are
swapped because I'm on a little-endian machine.

% set n [encoding convertto utf-8 $a]
Å % set n2 [encoding convertto unicode [encoding convertfrom utf-8 $n]]
`
% binary scan $n2 H* hex
1
% puts $hex
6001

The other problem is:

% set a \u0160
¦
% binary scan $a H* hex
1
% puts $hex
60 << Not what was expecting 0160

Now you're trying to apply [binary scan] to a string that isn't
a byte array. What [binary scan] does in that case is to interpret
each character as a byte and discard the most significant bits.

I suspect that you're working way too hard.

The [encoding convertto] and [encoding convertfrom] commands are
chiefly useful for dealing with strings that need to be embedded
in binary data. If that isn't what you have, you don't need to
use them. For day-to-day use, you just configure channels to have
the needed encoding, and read and write strings on those channels.

If you're simply trying to extract the information of 'what
Unicode code point is this character' or 'what character is this
Unicode code point', it's easier to use [scan] and [format]:

% foreach c [split $a {}] {
scan $c %c n
puts [format %#06x $n]
}
0x0160

The subject line suggests that you are trying to encode data in
the Windows code page 1252. cp1252 IS NOT UTF-8. It's IS08859-1,
with a number of characters in the range \x80-\x9f replaced by
Windows-specific things. Tcl will happily encode things in that
code page; use 'cp1252' in place of 'utf-8' or 'unicode'.

You may also be laboring under the misconception that because
your script was encoded in CP1252, that the strings at run time
will be CP1252. That's not true. Tcl converted your script to
its internal representation (which happens to be UTF-8, but that's
none of your business unless you're writing C code to deal with
Tcl strings). That leaves you with the simpler problem of
"how do I convert strings to/from a given encoding, given
Tcl's internal representation". That's what [encoding
convert*] does, and that's why there's a 'convertfrom' in
addition to a 'convertto'.

--
73 de ke9tv/2, Kevin
.



Relevant Pages

  • Re: Problem reading file with umlauts
    ... UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range ... This file is contains data in the unicode ... character set and is encoded with utf-8. ...
    (comp.lang.python)
  • Re: unicode
    ... 'ascii' codec can't encode character u'\u9999' in ... it looks like when I try to display the string, ... If you try to print a Unicode string, then Python will attempt to first ... encode it using the default encoding for that file. ...
    (comp.lang.python)
  • Re: D6 and COM
    ... > followed by a constant Dword 8h and an additional empty Unicode. ... Donald, it looks to me like that might be the raw Unicode character data, ... rather than the actual string you are expected to pass (i.e. the hex is the ... character code point values, not a hex string). ...
    (comp.lang.smalltalk.dolphin)
  • Re: wchar_t
    ... >> characters between the three major east asian languages. ... >> steam ahead with dropping Big5 and adopting Unicode pretty pervasively. ... > of effective character codes, ... Even if you wanted to encode ...
    (comp.lang.c)
  • Re: Unicode and hex numbers for special characters
    ... the unicode (hex number) had been converted to a decimal number ... The character I am trying to insert does not reproduce here on HTLM and I ... >> I now need to convert the hex character Word back into Unicode in ...
    (microsoft.public.word.printingfonts)