Re: Unicode from Web to MySQL

From: Francis Avila (francisgavila_at_yahoo.com)
Date: 12/21/03


Date: Sat, 20 Dec 2003 22:44:18 -0500

Bill Eldridge wrote in message ...
>Skip Montanaro wrote:
>Encoding for example is a UTF-8 page Vietnamese,
>try:
>
> http://www.rfa.org/service/index.html?service=vie
>or
> http://www.rfa.org/service/article.html?service=vie&encoding=9&id=123655
>
>I've tried grabbing this, doing vietstring.decode(None,'strict')
>gives an error (wants a string, not None), doing
>unicode(data,'unicode','replace') fails,
>unicode(data,'raw-unicode-escape','replace') somewhat works,
>I can then try
>unicode(data,'raw-unicode-escape','replace').encode('utf-8')
>but I get a SQL error at that point.

You still have not understood the crucial lession: Unicode is *not* *an*
*encoding*. Not an encoding!

Immediate logical ramification: UTF8 (or whatever other encoding you wish to
name) IS NOT UNICODE!

Let's look at each of your attempts and see why each makes no sense:

>>> vietstring.decode(None, 'strict)
How can a str be decoded from nothing? If it has no encoding, it's just raw
bytes with no meaningful interpretation of those bytes. And now you want a
unicode object to be magically produced?

What you should say is, "Ok, I know that vietstring is utf8 encoded, so to
decode it (to a unicode object), I guess I'll have to tell Python
vietstring.decode('utf8'), meaning 'Decode vietstring from utf8.'"

>>> viet =
urllib.urlopen('http://www.rfa.org/service/index.html?service=vie')
>>> vietstr = viet.read()
>>> type(vietstr) # Raw bits; no intrinsic meaning
<type 'str'>
>>> vietunicode = vietstr.decode('utf8')
>>> type(vietunicode) # Raw intrinsic meaning; no bits.
<type 'unicode'>
>>>

unicode and str are diametrically opposed views of reality. Unicode is the
rationalist--there's no reality outside of meaning (i.e., no bits). Str is
the empiricist--there's only raw bits, and the only meaning is what you give
them.

>>> unicode(data, 'unicode', 'replace')
You want a unicode object to be produced from data, which you declare as
being in the 'unicode' encoding. But there's no such encoding! Unicode is
*not* an encoding! Unicode is more abstract than bytes. Do not ever think
of bytes and unicode in the same thought.

>>> unicode(data, 'raw-unicode-escape', 'replace')

This may seem to work, but really its exactly the same as ur'<contents of
data>'--its treating data as though it were a raw unicode literal:

>>> s = '\\u1234'
>>> len(s)
6
>>> us = unicode(s, 'raw-unicode-escape')
>>> us
u'\u1234'
>>> len(us)
1

This is not what you want! So
unicode(data,'raw-unicode-escape','replace').encode('utf-8') is the
utf8-encoded str of what you didn't want in the first place!

vietstring.decode('utf8') will give you what you want, namely, a unicode
object. Before you feed the unicode object to SQL, encode it to utf8 (a str
object). This part you seem to understand just fine, but you have some
sort of mental block against recognizing that you need to decode the string
you got from the web before you can get a unicode object!

In this particular case, (where it's already utf8) you can put vietstring
straight into the SQL database as you found it, without doing any conversion
at all. But this is only because the raw bits are the same they would have
been if you had decoded to pure unicode and then encoded to utf8.

To make sure that all your problems are with Python unicode<->str conversion
confusion, and NOT with SQL, try placing vietstring straight into SQL
without touching it.

--
Francis Avila


Relevant Pages

  • Re: minidom xml & non ascii / unicode & files
    ... So the whole thing is to regex parse some html document, and store the results inside an xml file that can be parsed again by python minidom for further use.. ... # everything here is still unicode objects ... # GetNodeValue returns a unicode object or None ... At least as long as you don't use statements or operators that will implicitely try to convert the unicode object back to bytestring using your default encoding which will most certainly result in codec Errors... ...
    (comp.lang.python)
  • Re: Python 3.1.1 bytes decode with replace bug
    ... In the original example I decoded to UTF-8 and in this example the ... The problem in your original example, and in the current one, is not in decode(), but in encode, which is implicitly called by print, when needed to convert from Unicode to some byte format of the console. ... But since you're running in a debugger, there's an implicit print, which is converting unicode into whatever your default console encoding is. ...
    (comp.lang.python)
  • Re: Python 3.1.1 bytes decode with replace bug
    ... The problem in your original example, and in the current one, is not in decode(), but in encode, which is implicitly called by print, when needed to convert from Unicode to some byte format of the console. ... and converts *FROM* utf8 string to a unicode one. ... But since you're running in a debugger, there's an implicit print, which is converting unicode into whatever your default console encoding is. ...
    (comp.lang.python)
  • Re: C# and encodings
    ... different encoding than Unicode does (Unicode set uses three ... Any character encoding that is not Unicode by definition uses a different encoding than Unicode does. ... The point is that the Unicode "character" 0xfeff is not representable in any ANSI code page, and is treated specially by stripping it from input rather than replacing it with the "default character". ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Unicode/UTF-8 decoding
    ... I don't really know how this work, but IE or Firefox browser can decode easily. ... This text looks as it has been decoded with a different encoding than was used to encode it. ... If you want to store unicode strings in the MySQL database, it has to be set up to use unicode as character set. ... While this gives the correct result for some strings, some byte codes used in UTF-8 doesn't represent a single character by themselves, so if you contine to store mis-decoded strings as unicode, you will sooner or later experience corrupted strings. ...
    (microsoft.public.dotnet.languages.vb)