Re: Dealing with "funny" characters



No, no, that's wrong. MySQL and the Python interface to it understand
Unicode. You don't want to convert data to UTF-8 before putting it in a
database; the database indexing won't work.

I doubt that indexing has anything to do with it whatsoever.

Here's how to do it right.

First, tell MySQL, before you create your MySQL tables, that the tables are
to be stored in Unicode:

ALTER database yourdatabasename DEFAULT CHARACTER SET utf8;

You can also do this on a table by table basis, or even for single fields,
but you'll probably get confused if you do.

Then, when you connect to the database in Python, use something like this:

db = MySQLdb.connect(host="localhost",
use_unicode = True, charset = "utf8",
user=username, passwd=password, db=database)

That tells MySQLdb to talk to the database in Unicode, and it tells the database
(via "charset") that you're talking Unicode.

You confuse unicode with utf-8 here. And while this appears to be nitpicking, it is important to write this small program and meditate the better part of an hour in front of it running:

while True:
print "utf-8 is not unicode"


You continue to make that error below, so I snip that.

The important part is this: unicode is a standard that aims to provide a codepoint for each and every character that humankind has invented. And python unicode objects can also represent all characters one can imagine.

However, unicode as such is an abstraction. Harddisks, network sockets, databases and the like don't deal with abstractions though - the eat bytes. Which makes it necessary to encode unicode objects to byte-strings when serializing them. Thus there are the thingies called encodings: latin1 for most characters used in westen europe for example. But it is limited to 256 characters (actually, even less), chinese or russian customers won't get too happy with them.

So some encodings are defined that are capable of encoding _ALL_ unicode codepoints. Either by being larger than one byte for each character. Or by providing escape-mechanisms. The former are e.g. UCS4 (4 bytes per character), the most important member of the latter is utf-8. Which uses ascii + escapes to encode all codepoints.

Now what does that mean in python?

First of all, the coding:-declaration: it tells python which encoding to use when dealing with unicode-literals, which are the

u"something"

thingies. If you use a coding of latin1, that means that the text

u"ö"

is expected to be one byte long, with the proper value that depicts the german umlaut o in latin1. Which is 0xf6.

If coding: is set to utf-8, the same string has to consist not of one, but of two bytes: 0xc3 0xb6.

So, when editing files that are supposed to contain "funny" characters, you have to

- set your editor to save the file in an appropriate encoding

- specify the same encoding in the coding:-declaration

Regarding databases: they store bytes. Mostly. Some allow to store unicode by means of one of the fixed-size-encodings, but you pay a storage-size penalty for that.

So - you we're right when you said that one can change the encoding a db uses, on several levels even.

But that's not all that is to it. Another thing is the encoding the CONNECTION expects byte-strings to be passed, and will use to render returned strings in. The conversion from and to the used storage encoding is done automagically.

It is for example perfectly legal (and unfortunately happens involuntarily) to have a database that internally uses utf-8 as storage, potentially being able to store all possible codepoints.

But due to e.g. environmental settings, opened connections will deliver the contents in e.g. latin1. Which of course will lead to problems if you try to return data from the table with the topmost chines first names.

So you can alter the encoding the connection delivers and expects byte-strings in. In mysql, this can be done explcit using

cursor.execute("set names <encoding>")

Or - as you said - as part of a connection-string.

db = MySQLdb.connect(host="localhost",
use_unicode = True, charset = "utf8",
user=username, passwd=password, db=database)


But there is more to it. If the DB-API supports it, then the API itself will decode the returned strings, using the specified encoding, so that the user will only deal with "real" unicode-objects, greatly reducing the risk of mixing byte-strings with unicode-objects. That's what the use_unicod-parameter is for: it makes the API accept and deliver unicod-objects. But it would do so even if the charset-parameter was "latin1". Which makes me repeat the lesson from the beginning:

while True:
print "utf-8 is not unicode"


Diez
.



Relevant Pages

  • Re: Unicode string libraries
    ... encoding negotiation. ... old languages which have adopted Unicode without much pain. ... compatibility with too many old programs; but char as a holder for UTF-8 ... The limitations of UTF-16 ...
    (comp.programming)
  • Re: New Years Resolution (was Re: cell phones, was: car help, was: Starving people refuse to eat foo
    ... Its still UTF-8, or rather, a mangled UTF-8, but recognizable to any ... Characters in the range 0-127 require a single byte, ... Unicode is a method of encoding characters with a enough variety to ...
    (rec.arts.sf.written)
  • Re: convert from utf-8 to unicode(excel)
    ... Is there a possibility to properly convert under Windows from utf-8 ... encoding to unicode ... There is no problem in conversion when I do it in Notepad. ... a file marking encoding as UTF-8 and then save it marking encoding as ...
    (comp.editors)
  • Re: Unicode string libraries
    ... UTF-8 is the encoding that must be used ... I initially thought that the variable-length characters ... but also that UTF-8 didn't break when Unicode got extended ...
    (comp.programming)
  • Re: DB2 UTF-8 ODBC double conversion
    ... Unicode considers the various UTFs flavors completely equivalent. ... Just various encoding forms for the same thing. ... this means that everyone who is using that database has ... they can't use your database to represent as many characters as ...
    (microsoft.public.vc.mfc)