Re: utf8 and ftplib
- From: "John Roth" <newsgroups@xxxxxxxxxxxx>
- Date: Thu, 16 Jun 2005 12:06:50 -0600
"Richard Lewis" <richardlewis@xxxxxxxxxxxxxx> wrote in message news:mailman.540.1118935910.10512.python-list@xxxxxxxxxxxxx
Hi there,
I'm having a problem with unicode files and ftplib (using Python 2.3.5).
I've got this code:
xml_source = codecs.open("foo.xml", 'w+b', "utf8") #xml_source = file("foo.xml", 'w+b')
ftp.retrbinary("RETR foo.xml", xml_source.write) #ftp.retrlines("RETR foo.xml", xml_source.write)
It opens a new local file using utf8 encoding and then reads from a file on an FTP server (also utf8 encoded) into that local file. It comes up with an error, however, on calling the xml_source.write callback (I think) saying that:
"File "myscript.py", line 75, in get_content ftp.retrbinary("RETR foo.xml", xml_source.write) File "/usr/lib/python2.3/ftplib.py", line 384, in retrbinary callback(data) File "/usr/lib/python2.3/codecs.py", line 400, in write return self.writer.write(data) File "/usr/lib/python2.3/codecs.py", line 178, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 76: ordinal not in range(128)"
I've tried using both the commented lines of code in the above example (i.e. using file() instead of codecs.open() and retlines() instead of retbinary()). retlines() makes no difference, but if I use file() instead of codecs.open() I can open the file, but the extended characters from the source file (e.g. foreign characters, copyright symbol, etc.) all appear with an extra character in front of them (because of the two char width in utf8?).
Is the xml_source.write callback causing the problem here? Or is it something else? Is there any way that I can correctly retrieve a utf8 encoded file from an FTP server?
It looks like there are at least two problems here. The major one is that you seem to have a misconception about utf-8 encoding.
The _disk_ version of the file is what is encoded in utf-8, and it has to be decoded to unicode on being read later. In other words, what you got is what you should have put on disk without any conversion. As you noted, when you did that, the FTP part of the process worked.
Whatever program you are using to read it has to then decode it from utf-8 into unicode. Failure to do this is what is causing the extra characters on output.
The object returned by codecs.open raised an exception because it expected a unicode string on input; it got a character string already encoded in utf-8 format. The internal mechanism is first going to try to decode that into unicode before then encoding it into utf-8. Unfortunately, the default for encoding or decoding (outside of special contexts) is ASCII-7. So everything outside of the ASCII range is invalid.
Amusingly, this would have worked:
xml_source = codecs.EncodedFile("foo.xml", "utf-8", "utf-8")It is, of course, an expensive way of doing nothing, but it at least has the virtue of being good documentation.
HTH
John Roth
Cheers,
Richard
.
- Follow-Ups:
- Re: utf8 and ftplib
- From: Richard Lewis
- Re: utf8 and ftplib
- References:
- utf8 and ftplib
- From: Richard Lewis
- utf8 and ftplib
- Prev by Date: Re: access properties of parent widget in Tkinter
- Next by Date: Re: Set of Dictionary
- Previous by thread: utf8 and ftplib
- Next by thread: Re: utf8 and ftplib
- Index(es):
Relevant Pages
|
Loading