Re: Unicode literals and byte string interpretation.



On Thu, 27 Oct 2011 20:05:13 -0700, Fletcher Johnson wrote:

If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
this creation process interpret the bytes in the byte string?

It doesn't, because there is no byte-string. You have created a Unicode
object from a literal string of unicode characters, not bytes. Those
characters are:

Dec Hex Char
130 0x82 ‚
177 0xb1 ±
130 0x82 ‚
234 0xea ê
130 0x82 ‚
205 0xcd Í

Don't be fooled that all of the characters happen to be in the range
0-255, that is irrelevant.


Does it
assume the string represents a utf-16 encoding, at utf-8 encoding,
etc...?

None of the above. It assumes nothing. It takes a string of characters,
end of story.

For reference the string is これは in the 'shift-jis' encoding.

No it is not. The way to get a unicode literal with those characters is
to use a unicode-aware editor or terminal:

s = u'これは'
for c in s:
.... print ord(c), hex(ord(c)), c
....
12371 0x3053 こ
12428 0x308c れ
12399 0x306f は


You are confusing characters with bytes. I believe that what you are
thinking of is the following: you start with a byte string, and then
decode it into unicode:

bytes = '\x82\xb1\x82\xea\x82\xcd' # not u'...'
text = bytes.decode('shift-jis')
print text
これは


If you get the encoding wrong, you will get the wrong characters:

print bytes.decode('utf-16')
놂춂


If you start with the Unicode characters, you can encode it into various
byte strings:

s = u'これは'
s.encode('shift-jis')
'\x82\xb1\x82\xea\x82\xcd'
s.encode('utf-8')
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf'






--
Steven
.



Relevant Pages

  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • Re: Fast UTF-8 strlen function
    ... >> Is there a fast UTF-8 string length function floating around? ... Length in bytes, or length in characters? ... For UTF-8, the main basic "change" you have to make to your string routines ... then I could individually look up the characters in my UNICODE ...
    (alt.lang.asm)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • Re: Prothon should not borrow Python strings!
    ... """It does not make sense to have a string without knowing what encoding ... same cul de sac as Python. ... Prothon_String_As_ASCII // raises error if there are high characters ... Python's split between byte strings and Unicode strings is ...
    (comp.lang.python)