Re: Unicode literals and byte string interpretation.



On Thu, 27 Oct 2011 20:05:13 -0700, Fletcher Johnson wrote:

If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
this creation process interpret the bytes in the byte string?

It doesn't, because there is no byte-string. You have created a Unicode
object from a literal string of unicode characters, not bytes. Those
characters are:

Dec Hex Char
130 0x82 ‚
177 0xb1 ±
130 0x82 ‚
234 0xea ê
130 0x82 ‚
205 0xcd Í

Don't be fooled that all of the characters happen to be in the range
0-255, that is irrelevant.


Does it
assume the string represents a utf-16 encoding, at utf-8 encoding,
etc...?

None of the above. It assumes nothing. It takes a string of characters,
end of story.

For reference the string is これは in the 'shift-jis' encoding.

No it is not. The way to get a unicode literal with those characters is
to use a unicode-aware editor or terminal:

s = u'これは'
for c in s:
.... print ord(c), hex(ord(c)), c
....
12371 0x3053 こ
12428 0x308c れ
12399 0x306f は


You are confusing characters with bytes. I believe that what you are
thinking of is the following: you start with a byte string, and then
decode it into unicode:

bytes = '\x82\xb1\x82\xea\x82\xcd' # not u'...'
text = bytes.decode('shift-jis')
print text
これは


If you get the encoding wrong, you will get the wrong characters:

print bytes.decode('utf-16')
놂춂


If you start with the Unicode characters, you can encode it into various
byte strings:

s = u'これは'
s.encode('shift-jis')
'\x82\xb1\x82\xea\x82\xcd'
s.encode('utf-8')
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf'






--
Steven
.