Re: Ascii Encoding Error with UTF-8 encoder
- From: "Mike Currie" <dev@xxxxxxxx>
- Date: Tue, 27 Jun 2006 16:44:28 -0700
Thanks for the thorough explanation.
What I am doing is converting data for processing that will be tab (for
columns) and newline (for row) delimited. Some of the data contains tabs
and newlines so, I have to convert them to something else so the file
integrity is good.
Not my idea, I've been left with the implementation however.
"John Machin" <sjmachin@xxxxxxxxxxx> wrote in message
news:44a1bbcb$1@xxxxxxxxxxxxxxxxx
On 28/06/2006 7:46 AM, Mike Currie wrote:
Can anyone explain why I'm getting an ascii encoding error when I'm
trying to write out using a UTF-8 encoder?
thisêhasêàtabsêandêlineàbreaksf = codecs.open('foo.txt', 'wU', 'utf-8')
print filteredLine
Traceback (most recent call last):f.write(filteredLine)
File "<stdin>", line 1, in ?
File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)
Your fundamental problem is that you are trying to decode an 8-bit string
to UTF-8. The codec tries to convert your string to Unicode first, using
the default encoding (ascii), which fails.
Get this into your head:
You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever
into an 8-bit string.
You decode whatever from an 8-bit string into Unicode.
Here is a run-down on your problem, using just the encode/decode methods
instead of codecs for illustration purposes:
(1) Equivalent to what you did.
|>> '\x88'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)
(2) Same thing, explicitly trying to decode your 8-bit string as ASCII.
|>> '\x88'.decode('ascii').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)
(3) Encoding Unicode as UTF-8 works, as expected.
|>> u'\x88'.encode('utf-8')
'\xc2\x88'
(4) But you need to know what your 8-bit data is supposed to be encoded
in, before you start.
|>> '\x88'.decode('cp1252').encode('utf-8')
'\xcb\x86'
|>> '\x88'.decode('latin1').encode('utf-8')
'\xc2\x88'
I am rather puzzled as to what you are trying to achieve. You appear to
believe that you possess one or more 8-bit strings, encoded in latin1,
which contain the C0 controls \x09 (HT) and \x0a (LF) AND the
corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change LF
to NEL, and NEL to LF and similarly with the other pair. Then you want to
write the result, encoded in UTF-8, to a file. The purpose behind that
baroque/byzantine capering would be .... what?
.
- Follow-Ups:
- Re: Ascii Encoding Error with UTF-8 encoder
- From: Serge Orlov
- Re: Ascii Encoding Error with UTF-8 encoder
- From: John Machin
- Re: Ascii Encoding Error with UTF-8 encoder
- References:
- Ascii Encoding Error with UTF-8 encoder
- From: Mike Currie
- Re: Ascii Encoding Error with UTF-8 encoder
- From: John Machin
- Ascii Encoding Error with UTF-8 encoder
- Prev by Date: Re: Python UTF-8 and codecs
- Next by Date: Re: [Pyrex] pyrex functions to replace a method (Re: replace a method in class: how?)
- Previous by thread: Re: Ascii Encoding Error with UTF-8 encoder
- Next by thread: Re: Ascii Encoding Error with UTF-8 encoder
- Index(es):
Relevant Pages
|