Re: Ascii Encoding Error with UTF-8 encoder



Thanks for the thorough explanation.

What I am doing is converting data for processing that will be tab (for
columns) and newline (for row) delimited. Some of the data contains tabs
and newlines so, I have to convert them to something else so the file
integrity is good.

Not my idea, I've been left with the implementation however.

"John Machin" <sjmachin@xxxxxxxxxxx> wrote in message
news:44a1bbcb$1@xxxxxxxxxxxxxxxxx
On 28/06/2006 7:46 AM, Mike Currie wrote:
Can anyone explain why I'm getting an ascii encoding error when I'm
trying to write out using a UTF-8 encoder?


f = codecs.open('foo.txt', 'wU', 'utf-8')
print filteredLine
thisêhasêàtabsêandêlineàbreaks
f.write(filteredLine)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)


Your fundamental problem is that you are trying to decode an 8-bit string
to UTF-8. The codec tries to convert your string to Unicode first, using
the default encoding (ascii), which fails.

Get this into your head:
You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever
into an 8-bit string.
You decode whatever from an 8-bit string into Unicode.

Here is a run-down on your problem, using just the encode/decode methods
instead of codecs for illustration purposes:

(1) Equivalent to what you did.
|>> '\x88'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)

(2) Same thing, explicitly trying to decode your 8-bit string as ASCII.
|>> '\x88'.decode('ascii').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)

(3) Encoding Unicode as UTF-8 works, as expected.
|>> u'\x88'.encode('utf-8')
'\xc2\x88'

(4) But you need to know what your 8-bit data is supposed to be encoded
in, before you start.
|>> '\x88'.decode('cp1252').encode('utf-8')
'\xcb\x86'
|>> '\x88'.decode('latin1').encode('utf-8')
'\xc2\x88'

I am rather puzzled as to what you are trying to achieve. You appear to
believe that you possess one or more 8-bit strings, encoded in latin1,
which contain the C0 controls \x09 (HT) and \x0a (LF) AND the
corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change LF
to NEL, and NEL to LF and similarly with the other pair. Then you want to
write the result, encoded in UTF-8, to a file. The purpose behind that
baroque/byzantine capering would be .... what?



.



Relevant Pages

  • Re: Python 3.1.1 bytes decode with replace bug
    ... The problem in your original example, and in the current one, is not in decode(), but in encode, which is implicitly called by print, when needed to convert from Unicode to some byte format of the console. ... and converts *FROM* utf8 string to a unicode one. ... But since you're running in a debugger, there's an implicit print, which is converting unicode into whatever your default console encoding is. ...
    (comp.lang.python)
  • Re: LANG, locale, unicode, setup.py and Debian packaging
    ... passed a unicode path. ... Then, I suppose, I will have to decode each resulting byte string (via the ... To display their filename on the gui and the console. ...
    (comp.lang.python)
  • Re: Ascii Encoding Error with UTF-8 encoder
    ... to write out using a UTF-8 encoder? ... The codec tries to convert your string to Unicode first, using the default encoding, which fails. ...
    (comp.lang.python)
  • Re: encode() question
    ... UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position ... happens when you convert a regular string to a unicode string. ... You are trying to encode a string. ...
    (comp.lang.python)
  • Re: String of numbers into to array of numbers
    ... Base 64 requires both an encoder and a decoder. ... That way you can have a program that will make that silly array for you. ... public static String encode(String s) { ... Now you have the encoded string to feed to the decode method. ...
    (comp.lang.java.help)