Re: Use of Unicode in Python 2.5 source code literals



Uncle Bruce wrote:
I'm working with Python 2.5.4 and the NLTK (Natural Language
Toolkit). I'm an experienced programmer, but new to Python.

This question arose when I tried to create a literal in my source code
for a Unicode codepoint greater than 255. (I also posted this
question in the NLTK discussion group).

The Python HELP (at least for version 2.5.4) states:

+++++++
Python supports writing Unicode literals in any encoding, but you have
to declare the encoding being used. This is done by including a
special comment as either the first or second line of the source file:

#!/usr/bin/env python
# -*- coding: latin-1 -*-
++++++++++++

Based on some experimenting I've done, I suspect that the support for
Unicode literals in ANY encoding isn't really accurate. What seems to
happen is that there must be an 8-bit mapping between the set of
Unicode literals and what can be used as literals.

Even when I set Options / General / Default Source Encoding to UTF-8,
IDLE won't allow Unicode literals (e.g. characters copied and pasted
from the Windows Character Map program) to be used, even in a quoted
string, if they represent an ord value greater than 255.

I noticed, in researching this question, that Marc Andre Lemburg
stated, back in 2001, "Since Python source code is defined to be
ASCII..."

I'm writing code for linguistics (other than English), so I need
access to lots more characters. Most of the time, the characters come
from files, so no problem. But for some processing tasks, I simply
must be able to use "real" Unicode literals in the source code.
(Writing hex escape sequences in a complex regex would be a
nightmare).

Was this taken care of in the switch from Python 2.X to 3.X?

Is there a way to use more than 255 Unicode characters in source code
literals in Python 2.5.4?

Also, in the Windows version of Python, how can I tell if it was
compiled to support 16 bits of Unicode or 32 bits of Unicode?

Bruce in Toronto

Works for me:

--- snip ---
$ cat snowman.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import unicodedata

snowman = u'☃'

print len(snowman)
print unicodedata.name(snowman)
$ python2.6 snowman.py
1
SNOWMAN
--- snip ---

What did you set the encoding to in the declaration at the top of the
file? The help text you quoted uses latin-1 as an example, an encoding
which, of course, only supports 256 code points. Did you try utf-8 instead?

The regular expression engine's Unicode support is a different question,
and I do not know the answer.

By the way, Python 2.x only supports using non-ASCII characters in
source code in string literals. Python 3 adds support for Unicode
identifiers (e.g. variable names, function argument names, etc.).
--
.



Relevant Pages

  • Re: break unichr instead of fix ord?
    ... I think the best feature of python is not, ... property that a Unicode literal always produces one ... differently" by producing two python characters. ... "surrogate pairs". ...
    (comp.lang.python)
  • Changing the (codec) error handler for the stdout/stderr streams in Python 3.0
    ... Just a tip for those who are only just cutting their teeth on Python 3.0 and might have encountered the same problem as I did: ... This is a nice feature to have, of course, but if the original Unicode string contains characters for which there is no equivalent in the terminal's legacy character set, you will get the dreaded "UnicodeEncodeError" exception. ... Now, I have written a more flexible custom error handler myself and registered it with Python's codec system, using the codecs.register_errorfunction. ...
    (comp.lang.python)
  • Re: urllib.unquote and unicode
    ... # Python 2.3.4 ... Either unicode string should be ... the permitted range for characters is encoded as % followed by two hex ... urllib.quoteshould encode into utf-8 instead of throwing KeyError ...
    (comp.lang.python)
  • Re: Prothon should not borrow Python strings!
    ... """It does not make sense to have a string without knowing what encoding ... same cul de sac as Python. ... Prothon_String_As_ASCII // raises error if there are high characters ... Python's split between byte strings and Unicode strings is ...
    (comp.lang.python)
  • Re: UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux & Windows XP
    ... For string literals, with the "coding" declaration, Python will accept ... "coding" declaration to produce a Unicode object which unambiguously ... represents the sequence of characters - ie. something that can be ... > strings and/or gibberished characters in Tk GUI title? ...
    (comp.lang.python)