Re: not quite 1252
- From: Anton Vredegoor <anton.vredegoor@xxxxxxxxx>
- Date: Wed, 26 Apr 2006 16:49:23 +0200
Fredrik Lundh wrote:
Anton Vredegoor wrote:
I'm trying to import text from an open office document (save as .sxw and
read the data from content.xml inside the sxw-archive using
elementtree and such tools).
The encoding that gives me the least problems seems to be cp1252,
however it's not completely perfect because there are still characters
in it like \93 or \94. Has anyone handled this before?
this might help:
http://effbot.org/zone/unicode-gremlins.htm
Thanks a lot! The code below not only made the strange chars go away, but it also fixed the xml-parsing errors ... Maybe it's useful to someone else too, use at own risk though.
Anton
from gremlins import kill_gremlins
from zipfile import ZipFile, ZIP_DEFLATED
def repair(infn,outfn):
zin = ZipFile(infn, 'r', ZIP_DEFLATED)
zout = ZipFile(outfn, 'w', ZIP_DEFLATED)
for x in zin.namelist():
data = zin.read(x)
if x == 'contents.xml':
zout.writestr(x,kill_gremlins(data).encode('cp1252'))
else:
zout.writestr(x,data)
zout.close()
def test():
infn = "xxxx.sxw"
outfn = 'dg.sxw'
repair(infn,outfn)
if __name__=='__main__':
test()
.
- Follow-Ups:
- Re: not quite 1252
- From: John Machin
- Re: not quite 1252
- References:
- not quite 1252
- From: Anton Vredegoor
- Re: not quite 1252
- From: Fredrik Lundh
- not quite 1252
- Prev by Date: RE: blob problems in pysqlite
- Next by Date: Re: Pyrex installation on windows XP: step-by-step guide
- Previous by thread: Re: not quite 1252
- Next by thread: Re: not quite 1252
- Index(es):
Relevant Pages
|