Re: unicode by default



On Wed, May 11, 2011 at 3:37 PM, harrismh777 <harrismh777@xxxxxxxxxxx> wrote:
hi folks,
  I am puzzled by unicode generally, and within the context of python
specifically. For one thing, what do we mean that unicode is used in python
3.x by default. (I know what default means, I mean, what changed?)

The `unicode' class was renamed to `str', and a stripped-down version
of the 2.X `str' class was renamed to `bytes'.

  I think part of my problem is that I'm spoiled (American, ascii heritage)
and have been either stuck in ascii knowingly, or UTF-8 without knowing
(just because the code points lined up). I am confused by the implications
for using 3.x, because I am reading that there are significant things to be
aware of... what?

Mainly Python 3 no longer does explicit conversion between bytes and
unicode, requiring the programmer to be explicit about such
conversions. If you have Python 2 code that is sloppy about this, you
may get some Unicode encode/decode errors when trying to run the same
code in Python 3. The 2to3 tool can help somewhat with this, but it
can't prevent all problems.

  On my installation 2.6  sys.maxunicode comes up with 1114111, and my 2.7
and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
default compile option for 2.7 & 3.2 (I didn't change anything) is set for
UCS-2 (UTF-16) or 2 byte unicode(?).   Do I understand this much correctly?

I think that UCS-2 has always been the default unicode width for
CPython, although the exact representation used internally is an
implementation detail.

  The books say that the .py sources are UTF-8 by default... and that 3..x is
either UCS-2 or UCS-4.  If I use the file handling capabilities of Python in
3.x (by default) what encoding will be used, and how will that affect the
output?

If you open a file in binary mode, the result is a non-decoded byte stream.

If you open a file in text mode and do not specify an encoding, then
the result of locale.getpreferredencoding() is used for decoding, and
the result is a unicode stream.

  If I do not specify any code points above ascii 0xFF does any of this
matter anyway?

You mean 0x7F, and probably, due to the need to explicitly encode and decode.
.



Relevant Pages