Re: Python Unicode to String conversion



En Mon, 17 Sep 2007 01:33:14 -0300, Richard Levasseur <richardlev@xxxxxxxxx> escribi�:

When dealing with unicode, i've run into situations where I have
multiple encodings in the same string, usually latin1 and utf8
(latin1 != ascii, and latin1 != utf8, and they don't play nice
together). So, for future readers, if you have problems dealing with
unicode encode and decode, try using a mix of latin1 and utf8
encodings to figure out whats going on, and what characters are
fubar'ing the process.

Life is easier if you follow these guidelines:
- work internally always in Unicode (not strings)
- All input data (read from files, coming from an Internet connection, typed by user...) should be decoded from byte strings into unicode as early as possible. (You should know which encoding your data comes in, in each case)
- All output data (written to files, printing to screen, etc) is encoded from unicode into byte strings as late as possible.

This way, unless your input data is garbage, you never could mix strings from different encodings.
For further information, read the Unicode Howto <http://www.amk.ca/python/howto/unicode> and this excerpt form the "Python Cookbook", by Alex Martelli <http://www.onlamp.com/pub/a/python/excerpt/pythonckbk_chap1/index.html>

--
Gabriel Genellina

.



Relevant Pages

  • Re: Problem with using char* to return string by reference
    ... strings into it (and if we're talking Unicode, ... for UTF-32 all encodings are multi-byte, since Unicode ... API function '). ...
    (microsoft.public.vc.language)
  • Re: diferences between 22 and python 23
    ... >>encodings quite happily with earlier versions and do find their code ... >>breaking in 2.3 because of this. ... >have these perfectly valid binary strings stored in string constants? ... be unicode literals that will be affected. ...
    (comp.lang.python)
  • Re: Problem with using char* to return string by reference
    ... strings into it (and if we're talking Unicode, ... for UTF-32 all encodings are multi-byte, since Unicode ...
    (microsoft.public.vc.language)
  • Re: diferences between 22 and python 23
    ... >>encodings quite happily with earlier versions and do find their code ... >>breaking in 2.3 because of this. ... >have these perfectly valid binary strings stored in string constants? ... be unicode literals that will be affected. ...
    (comp.lang.python)
  • Re: Unicode Support
    ... if two Unicode strings are the same? ... UTF-16 is basically telling everyone "ok we all got to start ... character, and will likely support *both* endians. ... UTF-8 encodings are also easy to learn to ...
    (alt.lang.asm)