Re: eval and unicode
- From: Jonathan Gardner <jgardner@xxxxxxxxxxxxxxxxxxx>
- Date: Thu, 20 Mar 2008 15:39:20 -0700 (PDT)
On Mar 20, 2:20 pm, Laszlo Nagy <gand...@xxxxxxxxxxxx> wrote:
>>> eval( u'"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' ) == eval( '"徹底し
たコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' )
True
When you feed your unicode data into eval(), it doesn't have any
encoding or decoding work to do.
Yes, but what about
eval( 'u' + '"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' )
Let's take it apart, bit by bit:
'u' - A byte string with one byte, which is 117
'"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' - A byte string starting with " (34),
but then continuing in an unspecified byte sequence. I don't know what
encoding your terminal/file/whatnot is written in. Assuming it is in
UTF-8 and not UTF-16, then it would be the UTF-8 representation of the
unicode code points that follow.
Before you are passing it to eval, you are concatenating them. So now
you have a byte string that starts with u, then ", then something
beyond 128.
Now, when you are calling eval, you are passing in that byte string.
This byte string, it is important to emphasize, is not text. It is
text encoded in some format. Here is what my interpreter does (in a
UTF-8 console):
u'\u5fb9\u5e95\u3057\u305f\u30b3\u30b9\u30c8\u524a\u6e1b \xc1\xcdu"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"
\u0170\u0150\xdc\xd6\xda\xd3\xc9 \u0442\u0440\u0438\u0440\u043e
\u0432\u0430'
The first item in the sequence is \u5fb9 -- a unicode code point. It
is NOT a byte.
'\xe5\xbe\xb9\xe5\xba\x95\xe3\x81\x97\xe3\x81\x9feval( '"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' )
\xe3\x82\xb3\xe3\x82\xb9\xe3\x83\x88\xe5\x89\x8a\xe6\xb8\x9b
\xc3\x81\xc3\x8d\xc5\xb0\xc5\x90\xc3\x9c\xc3\x96\xc3\x9a
\xc3\x93\xc3\x89 \xd1\x82\xd1\x80\xd0\xb8\xd1\x80\xd0\xbe
\xd0\xb2\xd0\xb0'
The first item in the sequence is \xe5. This IS a byte. This is NOT a
unicode point. It doesn't represent anything except what you want it
to represent.
u'\xe5\xbe\xb9\xe5\xba\x95\xe3\x81\x97\xe3\x81\x9feval( 'u"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' )
\xe3\x82\xb3\xe3\x82\xb9\xe3\x83\x88\xe5\x89\x8a\xe6\xb8\x9b
\xc3\x81\xc3\x8d\xc5\xb0\xc5\x90\xc3\x9c\xc3\x96\xc3\x9a
\xc3\x93\xc3\x89 \xd1\x82\xd1\x80\xd0\xb8\xd1\x80\xd0\xbe
\xd0\xb2\xd0\xb0'
The first item in the sequence is \xe5. This is NOT a byte. This is a
unicode point-- LATIN SMALL LETTER A WITH RING ABOVE.
u'\u5fb9\u5e95\u3057\u305f\u30b3\u30b9\u30c8\u524a\u6e1b \xc1\xcdeval( u'u"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' )
\u0170\u0150\xdc\xd6\xda\xd3\xc9 \u0442\u0440\u0438\u0440\u043e
\u0432\u0430'
The first item in the sequence is \u5fb9, which is a unicode point.
In the Python program file proper, if you have your encoding setup
properly, the expression
u"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"
is a perfectly valid expression. What happens is the Python
interpreter reads in that string of bytes between the quotes,
interprets them to unicode based on the encoding you already
specified, and creates a unicode object to represent that.
eval doesn't muck with encodings.
I'll try to address your points below in the context of what I just
wrote.
The passed expression is not unicode. It is a "normal" string. A
sequence of bytes.
Yes.
It will be evaluated by eval, and eval should know
how to decode the byte sequence.
You think eval is smarter than it is.
Same way as the interpreter need to
know the encoding of the file when it sees the u"徹底したコスト削減
ÁÍŰŐÜÖÚÓÉ трирова" byte sequence in a python source file - before
creating the unicode instance, it needs to be decoded (or not, depending
on the encoding of the source).
Precisely. And it is. Before it is passed to eval/exec/whatever.
String passed to eval IS python source, and it SHOULD have an encoding
specified (well, unless it is already a unicode string, in that case
this magic is not needed).
If it had an encoding specified, YOU should have decoded it and passed
in the unicode string.
Consider this:
exec("""
import codecs
s = u'Ű'
codecs.open("test.txt","w+",encoding="UTF8").write(s)
""")
Facts:
- source passed to exec is a normal string, not unicode
- the variable "s", created inside the exec() call will be a unicode
string. However, it may be Û or something else, depending on the
source encoding. E.g. ASCII encoding it is invalid and exec() should
raise a SyntaxError like:
SyntaxError: Non-ASCII character '\xc5' in file c:\temp\aaa\test.py on
line 1, but no encoding declared; seehttp://www.python.org/peps/pep-0263.htmlfor details
Well at least this is what I think. If I'm not right then please explain
why.
If you want to know what happens, you have to try it. Here's what
happens (again, in my UTF-8 terminal):
... import codecsexec("""
... s = u'Ű'
... codecs.open("test.txt","w+",encoding="UTF8").write(s)
... """)
u'\xc5\xb0's
Űprint s
'\xc3\x85\xc2\xb0'file('test.txt').read()
Űprint file('test.txt').read()
Note that s is a unicode string with 2 unicode code points. Note that
the file has 4 bytes--since it is that 2-code sequence encoded in
UTF-8, and both codes are not ASCII.
Your problem is, I think, that you think the magic of decoding source
code from the byte sequence into unicode happens in exec or eval. It
doesn't. It happens in between reading the file and passing the
contents of the file to exec or eval.
.
- Follow-Ups:
- Re: eval and unicode
- From: Laszlo Nagy
- Re: eval and unicode
- From: Laszlo Nagy
- Re: eval and unicode
- References:
- eval and unicode
- From: Laszlo Nagy
- Re: eval and unicode
- From: Jonathan Gardner
- Re: eval and unicode
- From: Laszlo Nagy
- eval and unicode
- Prev by Date: CDF python
- Next by Date: Re: os.path.getsize() on Windows
- Previous by thread: Re: eval and unicode
- Next by thread: Re: eval and unicode
- Index(es):
Relevant Pages
|