Re: not quite 1252
- From: John Machin <sjmachin@xxxxxxxxxxx>
- Date: Thu, 27 Apr 2006 18:23:04 +1000
On 27/04/2006 12:49 AM, Anton Vredegoor wrote:
Fredrik Lundh wrote:
Anton Vredegoor wrote:
I'm trying to import text from an open office document (save as .sxw and
read the data from content.xml inside the sxw-archive using
elementtree and such tools).
The encoding that gives me the least problems seems to be cp1252,
however it's not completely perfect because there are still characters
in it like \93 or \94. Has anyone handled this before?
this might help:
http://effbot.org/zone/unicode-gremlins.htm
Thanks a lot! The code below not only made the strange chars go away, but it also fixed the xml-parsing errors
What xml-parsing errors were they??
... Maybe it's useful to someone else too, use at own risk though.
Anton
from gremlins import kill_gremlins
from zipfile import ZipFile, ZIP_DEFLATED
def repair(infn,outfn):
zin = ZipFile(infn, 'r', ZIP_DEFLATED)
zout = ZipFile(outfn, 'w', ZIP_DEFLATED)
for x in zin.namelist():
data = zin.read(x)
if x == 'contents.xml':
Firstly, this should be 'content.xml', not 'contents.xml'.
Secondly, as pointed out by Sergei, the data is encoded by OOo as UTF-8 e.g. what is '\x94' in cp1252 is \u201d which is '\xe2\x80\x9d' in UTF-8. The kill_gremlins function is intended to fix Unicode strings that have been obtained by decoding 8-bit strings using 'latin1' instead of 'cp1252'. When you pump '\xe2\x80\x9c' through the kill_gremlins function, it changes the \x80 to a Euro symbol, and leaves the other two alone. Because the \x9d is not defined in cp1252, it then causes your code to die in a hole when you attempt to encode it as cp1252: UnicodeEncodeError: 'charmap' codec can't encode character u'\x9d' in position 1761: character maps to <undefined>
I don't see how this code repairs anything (quite the contrary!), unless there's some side effect of just read/writestr. Enlightenment, please.
zout.writestr(x,kill_gremlins(data).encode('cp1252'))
else:
zout.writestr(x,data)
zout.close()
.
- Follow-Ups:
- Re: not quite 1252
- From: Anton Vredegoor
- Re: not quite 1252
- References:
- not quite 1252
- From: Anton Vredegoor
- Re: not quite 1252
- From: Fredrik Lundh
- Re: not quite 1252
- From: Anton Vredegoor
- not quite 1252
- Prev by Date: Re: begging for a tree implementation
- Next by Date: Re: Events in Python?
- Previous by thread: Re: not quite 1252
- Next by thread: Re: not quite 1252
- Index(es):
Relevant Pages
|