Re: Unicode literals and byte string interpretation.



On Fri, Oct 28, 2011 at 2:05 PM, Fletcher Johnson <flt.johnson@xxxxxxxxx> wrote:
If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
this creation process interpret the bytes in the byte string? Does it
assume the string represents a utf-16 encoding, at utf-8 encoding,
etc...?

For reference the string is これは in the 'shift-jis' encoding.

Encodings define how characters are represented in bytes. I think
probably what you're looking for is a byte string with those hex
values in it, which you can then turn into a Unicode string:

a=b'\x82\xb1\x82\xea\x82\xcd'
unicode(a,"shift-jis") # use 'str' instead of 'unicode' in Python 3
u'\u3053\u308c\u306f'

The u'....' notation is for Unicode strings, which are not encoded in
any way. The last line of the above is a valid way of entering that
string in your source code, identifying Unicode characters by their
codepoints.

ChrisA
.



Relevant Pages

  • Re: Send string to IP address
    ... "Plain hex" implies something formatted as text, but doesn't answer the question of encoding. ... There's no "just" as far as "an ASCII string" is concerned. ... Characters are not bytes and bytes are not characters. ... Normally you'd create the Writer once at the same time as you create the underlying stream, rather than every time you write some text, obviously. ...
    (comp.lang.java.programmer)
  • Re: Byte Array to String
    ... retrieved text will mismatch the original characters. ... I think VBA may use the default system locale to ... encoding the characters. ... Dim strFileData as String ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: Character semantics for filenames (was: win32 reading wide filenames (unicode))
    ... Now file name is stored in utf8 format. ... it doesn't make any difference whether the string is internally ... DO WITH CHARACTERS ABOVE "\xFF". ... encoding to perl strings by readdir and from perl strings to the OS ...
    (comp.lang.perl.misc)
  • urwid with multi-byte encoded and bidirectional text?
    ... I would like to support whatever encoding the user likes. ... *new* line translation format would have to support characters that are ... N bytes in the string and M columns wide when displayed, ...
    (comp.lang.python)
  • Re: byte count unicode string
    ... in a "UTF-8 encoded Python string object", ... A Python Unicode string is an abstract sequence of ... UTF-8 is a character encoding; ...
    (comp.lang.python)