Re: not quite 1252



Fredrik Lundh wrote:

Anton Vredegoor wrote:

I'm trying to import text from an open office document (save as .sxw and
read the data from content.xml inside the sxw-archive using
elementtree and such tools).

The encoding that gives me the least problems seems to be cp1252,
however it's not completely perfect because there are still characters
in it like \93 or \94. Has anyone handled this before?

this might help:

http://effbot.org/zone/unicode-gremlins.htm

Thanks a lot! The code below not only made the strange chars go away, but it also fixed the xml-parsing errors ... Maybe it's useful to someone else too, use at own risk though.

Anton

from gremlins import kill_gremlins
from zipfile import ZipFile, ZIP_DEFLATED

def repair(infn,outfn):
zin = ZipFile(infn, 'r', ZIP_DEFLATED)
zout = ZipFile(outfn, 'w', ZIP_DEFLATED)
for x in zin.namelist():
data = zin.read(x)
if x == 'contents.xml':
zout.writestr(x,kill_gremlins(data).encode('cp1252'))
else:
zout.writestr(x,data)
zout.close()

def test():
infn = "xxxx.sxw"
outfn = 'dg.sxw'
repair(infn,outfn)

if __name__=='__main__':
test()
.



Relevant Pages

  • Re: not quite 1252
    ... solved my problem with understanding the encoding. ... from zipfile import ZipFile, ZIP_DEFLATED ... def utfCheck: ...
    (comp.lang.python)
  • How to create python codecs?
    ... I need utf-8 to utf-8 encoding which would change the text ... I`ve tried create simple utf to utf codec for some symbols but it ... def encode: ...
    (comp.lang.python)
  • Fwd: Please Forward: Ruby Quiz Submission
    ... Subject: Please Forward: Ruby Quiz Submission ... def huffman_encode ...
    (comp.lang.ruby)
  • Re: Creating referenceable objects from XML
    ... Are there any quality implementations that will (after ... > parsing the XML) return an object that is accessible by name? ... Here's an approach to ElementTree that worked for me. ... def __fromstring: ...
    (comp.lang.python)
  • Re: How to ask sax for the file encoding
    ... def XmlDecl(self, version, encoding, standalone): ... Parser = expat.ParserCreate ...
    (comp.lang.python)