Re: utf8 and ftplib



"Richard Lewis" <richardlewis@xxxxxxxxxxxxxx> wrote in message news:mailman.540.1118935910.10512.python-list@xxxxxxxxxxxxx
Hi there,

I'm having a problem with unicode files and ftplib (using Python 2.3.5).

I've got this code:

xml_source = codecs.open("foo.xml", 'w+b', "utf8")
#xml_source = file("foo.xml", 'w+b')

ftp.retrbinary("RETR foo.xml", xml_source.write)
#ftp.retrlines("RETR foo.xml", xml_source.write)

It opens a new local file using utf8 encoding and then reads from a file
on an FTP server (also utf8 encoded) into that local file. It comes up
with an error, however, on calling the xml_source.write callback (I
think) saying that:

"File "myscript.py", line 75, in get_content
 ftp.retrbinary("RETR foo.xml", xml_source.write)
File "/usr/lib/python2.3/ftplib.py", line 384, in retrbinary
 callback(data)
File "/usr/lib/python2.3/codecs.py", line 400, in write
 return self.writer.write(data)
File "/usr/lib/python2.3/codecs.py", line 178, in write
 data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 76:
ordinal not in range(128)"

I've tried using both the commented lines of code in the above example
(i.e. using file() instead of codecs.open() and retlines() instead of
retbinary()). retlines() makes no difference, but if I use file()
instead of codecs.open() I can open the file, but the extended
characters from the source file (e.g. foreign characters, copyright
symbol, etc.) all appear with an extra character in front of them
(because of the two char width in utf8?).

Is the xml_source.write callback causing the problem here? Or is it
something else? Is there any way that I can correctly retrieve a utf8
encoded file from an FTP server?

It looks like there are at least two problems here. The major one is that you seem to have a misconception about utf-8 encoding.

The _disk_ version of the file is what is encoded in utf-8, and it has
to be decoded to unicode on being read later. In other words,
what you got is what you should have put on disk without any
conversion. As you noted, when you did that, the FTP part of
the process worked.

Whatever program you are using to read it has to then decode
it from utf-8 into unicode. Failure to do this is what is causing
the extra characters on output.

The object returned by codecs.open raised an exception
because it expected a
unicode string on input; it got a character string already
encoded in utf-8 format. The internal mechanism is first
going to try to decode that into unicode before then
encoding it into utf-8. Unfortunately, the default for
encoding or decoding (outside of special contexts) is
ASCII-7. So everything outside of the ASCII range
is invalid.

Amusingly, this would have worked:

xml_source = codecs.EncodedFile("foo.xml", "utf-8", "utf-8")

It is, of course, an expensive way of doing nothing, but
it at least has the virtue of being good documentation.

HTH

John Roth





Cheers,
Richard

.



Relevant Pages

  • Re: Help me!! Why java is so popular
    ... Well, Unicode is not a storage encoding system, or anything like that. ... Unicode is primarily a mapping from characters (in the linguistic conceptual ... French, Russian, Japanese and Korean songs. ...
    (comp.lang.java.programmer)
  • Re: DB2 UTF-8 ODBC double conversion
    ... Unicode considers the various UTFs flavors completely equivalent. ... Just various encoding forms for the same thing. ... they can't use your database to represent as many characters as ... are required in order to support the GB-18030 Chinese National standard. ...
    (microsoft.public.vc.mfc)
  • Re: TCHAR string?
    ... According to Microsoft's documentation the 'A' functions are "ANSI" ... although Unicode is not itself an ISO standard; ... just as much an ISO encoding as any of the ISO encodings ... Windows) *was* to be able to represent any of the characters of the ...
    (microsoft.public.vc.mfc)
  • Re: Unicode support in Smalltalk
    ... Characters 128-255, as they mean both "the bytes 128-255 used in the ... encoding of a String" and "the Unicode Characters whose code points are ... Characters represent the encoding, UnicodeCharacters represent, well, ... EncodedString class that holds explicitly the encoding, ...
    (comp.lang.smalltalk)
  • Re: Unicode Support
    ... >> (I know this is a poor example, but think about other languages, eg ... First things first, when you register your RosAsm windows classes, you ... the messages with ANSI / UNICODE parameters in ANSI or UNICODE form... ... with their alphabet characters, as with the numbers and punctuation...so, ...
    (alt.lang.asm)

Loading