Re: Unicode literals and byte string interpretation.

On Thu, 27 Oct 2011 20:05:13 -0700, Fletcher Johnson wrote:

If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
this creation process interpret the bytes in the byte string?

It doesn't, because there is no byte-string. You have created a Unicode
object from a literal string of unicode characters, not bytes. Those
characters are:

Dec Hex Char
130 0x82 ‚
177 0xb1 ±
130 0x82 ‚
234 0xea ê
130 0x82 ‚
205 0xcd Í

Don't be fooled that all of the characters happen to be in the range
0-255, that is irrelevant.

Does it
assume the string represents a utf-16 encoding, at utf-8 encoding,

None of the above. It assumes nothing. It takes a string of characters,
end of story.

For reference the string is これは in the 'shift-jis' encoding.

No it is not. The way to get a unicode literal with those characters is
to use a unicode-aware editor or terminal:

s = u'これは'
for c in s:
.... print ord(c), hex(ord(c)), c
12371 0x3053 こ
12428 0x308c れ
12399 0x306f は

You are confusing characters with bytes. I believe that what you are
thinking of is the following: you start with a byte string, and then
decode it into unicode:

bytes = '\x82\xb1\x82\xea\x82\xcd' # not u'...'
text = bytes.decode('shift-jis')
print text

If you get the encoding wrong, you will get the wrong characters:

print bytes.decode('utf-16')

If you start with the Unicode characters, you can encode it into various
byte strings:

s = u'これは'