Re: not quite 1252



On 27/04/2006 12:49 AM, Anton Vredegoor wrote:
Fredrik Lundh wrote:

Anton Vredegoor wrote:

I'm trying to import text from an open office document (save as .sxw and
read the data from content.xml inside the sxw-archive using
elementtree and such tools).

The encoding that gives me the least problems seems to be cp1252,
however it's not completely perfect because there are still characters
in it like \93 or \94. Has anyone handled this before?

this might help:

http://effbot.org/zone/unicode-gremlins.htm

Thanks a lot! The code below not only made the strange chars go away, but it also fixed the xml-parsing errors

What xml-parsing errors were they??

... Maybe it's useful to someone else too, use at own risk though.

Anton

from gremlins import kill_gremlins
from zipfile import ZipFile, ZIP_DEFLATED

def repair(infn,outfn):
zin = ZipFile(infn, 'r', ZIP_DEFLATED)
zout = ZipFile(outfn, 'w', ZIP_DEFLATED)
for x in zin.namelist():
data = zin.read(x)
if x == 'contents.xml':

Firstly, this should be 'content.xml', not 'contents.xml'.

Secondly, as pointed out by Sergei, the data is encoded by OOo as UTF-8 e.g. what is '\x94' in cp1252 is \u201d which is '\xe2\x80\x9d' in UTF-8. The kill_gremlins function is intended to fix Unicode strings that have been obtained by decoding 8-bit strings using 'latin1' instead of 'cp1252'. When you pump '\xe2\x80\x9c' through the kill_gremlins function, it changes the \x80 to a Euro symbol, and leaves the other two alone. Because the \x9d is not defined in cp1252, it then causes your code to die in a hole when you attempt to encode it as cp1252: UnicodeEncodeError: 'charmap' codec can't encode character u'\x9d' in position 1761: character maps to <undefined>

I don't see how this code repairs anything (quite the contrary!), unless there's some side effect of just read/writestr. Enlightenment, please.

zout.writestr(x,kill_gremlins(data).encode('cp1252'))
else:
zout.writestr(x,data)
zout.close()

.



Relevant Pages

  • Setting the encoding in the basic auth header
    ... The user can use any unicode character in the username ... encoded by the browser before transmission. ... to encode the data as utf-8 before sending it over? ... any way I can get them to encode the data with utf-8 instead? ...
    (comp.lang.python)
  • Re: DOM: accented characters
    ... The xml parser assumes the data is encoded in utf-8. ... By itself that byte is not a character. ... so the xml file looks like ... Or encode using utf-8. ...
    (comp.lang.php)
  • Re: write smiley to file
    ... use Encode; ... There is a difference between UTF-8 and Unicode characters. ... the same Unicode character. ... The Unicode character is higher than 127, so we can ignore the first rule. ...
    (perl.beginners)
  • Re: Psycopg and queries with UTF-8 data
    ... > how do I get my utf-8 encoded data into the DB? ... This sounds like the usual unicode/utf-8 confusion: ... So unicode objects encapsulate abstract unicode character sequence - however ... Do encode the unicode object in utf-8, and pass that to the psycopg. ...
    (comp.lang.python)
  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)