Re: Unicode drives me crazy...





fowlertrainer@xxxxxxxxxxxx wrote:
> Hi !
>
> I want to get the WMI infos from Windows machines.
> I use Py from HU (iso-8859-2) charset.
>
> Then I wrote some utility for it, because I want to write it to an XML file.
>
> def ToHU(s,NoneStr='-'):
> if s==None: s=NoneStr
> if not (type(s) in [type(''),type(u'')]):
> s=str(s)
> if type(s)<>type(u''):
> s=unicode(s)
> s=s.replace(chr(0),' ');
> s=s.encode('iso-8859-2')
> return s
>
> This fn is working, but I have been got an error with this value:
> 'Kommunik\xe1ci\xf3s port (COM1)'
>
> This routine demonstrates the problem
>
> s='Kommunik\xe1ci\xf3s port (COM1)'
> print s
> print type(s)
> print type(u'aaa')
> s=unicode(s) # error !
>
> This is makes me mad.
> How to I convert every objects to string, and convert (encode) them to
> iso-8859-2 (if needed) ?
>

s is a 'byte string' - a series of characters encoded in bytes. (As is
every string on some level). In order to convert that to a unicdoe
object, Python needs to know what encoding is used. In other words it
needs to know what character each byte represents.

See this :

t = s.decode('iso-8859-1')
t
u'Kommunik\xe1ci\xf3s port (COM1)'
print t
Kommunikációs port (COM1)
print type(s)
<type 'str'>
print type(t)
<type 'unicode'>

The decode instruction converts s into a unicode string - where Python
knows what every character is. If you call unicdoe with no encoding
specified, Python reverts to the system default - which is *probably*
'ascii'. You string contains characters which have *no meaning* in the
ascii codec - so it reports an error....

Does this help ?

Once you 'get unicode', Python support for it is pretty easy. It's a
slightly complicated subject though. Basically you need to *know* what
encoding is being used, and whenever you convert between unicode and
byte-strings you need to specify it.

What can complicate matters is that there are lot's of times an
*implicit* conversion can take place. Adding strings to unicode
objects, printing strings, or writing them to a file are the usual
times implicit conversion can happen. If you haven't specified an
encoding, then Python has to use the system default or the file object
default (sys.stdout often has a different default encoding than the one
returned by sys.getdefaultencoding()). It is these implicit conversions
that often cause the 'UnicodeDecodeError's and 'UnicodeEncodeError's.

HTH

Best Regards,

Fuzzy
http://www.voidspace.org.uk/python

> Please help me !
>
> Thanx for help:
> ft

.



Relevant Pages

  • Re: eval and unicode
    ... encoding your terminal/file/whatnot is written in. ... you have a byte string that starts with u, then ", then something ... The first item in the sequence is \u5fb9 -- a unicode code point. ...
    (comp.lang.python)
  • Re: Python HTML parser chokes on UTF-8 input
    ... I believe you are confusing unicode with unicode encoded into bytes with ... which in Python can only mean a UTF-8 encoded byte string. ... default encoding, but type 'unicode' does? ...
    (comp.lang.python)
  • Further changes to source encodings (Was: PEP 263 status check)
    ... " PEP 263 status check", ... character in an 8-bit string literal. ... It appears that you are now saying that Python ... UTF-8 is such an encoding. ...
    (comp.lang.python)
  • chapter3
    ... An Informal Introduction to Python ... the hash character, "#", and extend to the end of the physical line. ... string literal is just a hash character. ... Unicode Strings ...
    (Ubuntu)
  • Re: str() should convert ANY object to a string without EXCEPTIONS !
    ... Your unicode string is a single character with ... Python refuses to guess which encoding you want. ...
    (comp.lang.python)

Quantcast