Re: Psycopg and queries with UTF-8 data

From: Diez B. Roggisch (deetsNOSPAM_at_web.de)
Date: 10/14/04


Date: Thu, 14 Oct 2004 12:57:31 +0200

Alban Hertroys wrote:

> I have a query that inserts data originating from an utf-8 encoded XML
> file. And guess what, it contains utf-8 encoded characters...
> Now my problem is that psycopg will only accept queries of type str, so
> how do I get my utf-8 encoded data into the DB?

This sounds like the usual unicode/utf-8 confusion: unicode is an abstract
specification of characters, utf-8 as well as latin1 and ascii are
encodings of that specification that allow for certain characters to be
used - namely, ascii for only well-known first 127, latin1 for some major
european languages, and utf-8 defines escapes for all possible characters
defined in unicode - with the result that some of the characters aren't one
byte per character anymore.

So unicode objects encapsulate abstract unicode character sequence - however
they accomplish that is not of your concern. strings on the opposite, are
pure byte sequences - and common libs work with them, with the exception of
the usually unicode aware xml libs. So to yield a string from an unicode
object, one has to specify an encoding - like utf-8 or latin1. Now having a
character in that unicode object that can't be encoded using the specified
encoding, that will produce an error.

Please do read a tutorial on unicode and python - there are several good
ones out there, use google to your advantage.

>
> I can't do query.encode('ascii'), that would be similar to:
> >>> x = u'\xc8'
> >>> print x.encode('ascii')
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xc8' in
> position 0: ordinal not in range(128)

Sure- xC8 > 127, so it can't be encoded. Do this:

>>> x = u'\xc8'
>>> x
u'\xc8'
>>> x.encode('utf-8')
'\xc3\x88'

As you can see, the formerly one byte long character becomes two bytes. The
reason is that on unicode character is translated to that 2-byte sequence
using utf-8.

> I also tried setting PostgreSQL's client-encoding by executing "SET
> client_encoding TO 'utf-8'", but psycopg still only accepts str-type
> strings (which is not really surprising).

Confusion again - please repeat:

unicode is not utf-8!!!
unicode is not utf-8!!!
unicode is not utf-8!!!
unicode is not utf-8!!!

Do encode the unicode object in utf-8, and pass that to the psycopg. If you
set client_encoding to latin1, you have to encode unicod to that.

-- 
Regards,
Diez B. Roggisch


Relevant Pages

  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: Perl opting for double-byte chars?
    ... sure Unicode has something to do with your problem, ... Without knowing the version of Perl you're using and the platform ... The UTF-8 encoding uses variable-length character ... perldoc Encode ...
    (comp.lang.perl.misc)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... Simply make a straight decision now - you will use UTF-8. ... character format) much like UTF-8 which itself ... I would have little more than UNICODE left. ... generator is assembly language. ...
    (comp.arch.embedded)
  • Re: Posting with XHR and ISO-8859-15
    ... UTF-8 code units can be byte values ... Latin-9, and Unicode are the same, so there wouldn't be any troubles ... URIs, I can't use encodeURIComponent. ... ISO-8859-xx in the sense that not every character that can be encoded ...
    (comp.lang.javascript)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... No other encodings - no Latin-1, no UTF-16, no home-made character sets, no extra fonts. ... Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. ... A couple of days work here is a drop in the ocean compared to the man-years it will take to work with your home-made encoding, and you will at least have the benefit of a better understanding of your problem. ... I would have little more than UNICODE left. ...
    (comp.arch.embedded)