cElementTree encoding woes



Hi,

I've got to deal with a pretty huge XML-document, and to do so I use the
cElementTree.iterparse functionality. Working great.

Only trouble: The guys creating that chunk of XML - well, lets just say they
are "encodingly challanged", so they don't produce utf-8, but only cp1252
instead, together with some weird name (Windows-1252) for that. That is not
part of the standard codecs module. cp1252 is, of course.

But that won't work for iterparse. So currently, I manually change the
encoding given to utf-8, and use a stream-recoder.

However, I was wondering if I could teach cElementTree about that encoding
name. I tried to register cp1252 under the name Windows-1252, but had no
luck - cET won't buy it.

Any suggestions?

Diez
.



Relevant Pages

  • Re: emacs and ASCII file to ISO-8859-* to UTF-8
    ... getting my accented letters converted from the old 8-bit encoding ... C-x RET r utf-8 RET should force emacs to read the file using the ... To UNSUBSCRIBE, email to debian-user-REQUEST@xxxxxxxxxxxxxxxx ... Trouble? ...
    (Debian-User)
  • Re: [VIM] Working with files in various encodings?
    ... you just need to convert this text into UTF-8 first. ... words make it all to be one encoding only. ... Don't forget, vim is a plain ... So if one chunk of your text is in KOI8-R and another ...
    (comp.editors)
  • Re: automating the SQL warning and the choice of text format
    ... automatically select 'yes' and 'utf-8' rather than changing the registry, ... In order to get the correct encoding, I believe that you have to do the ... You need one of those for each data source. ... For a comma-delimited file using UTF-8 encoding, ...
    (microsoft.public.word.mailmerge.fields)
  • Re: DBD::ODBC and character sets
    ... you have and accept UTF-8 encoded data does mean you need to "use ... encoding" but if your script is encoded in xxx you need "use encoding ... Perl sees the left-hand side of eq as a string literal containg sixcharacters encoded as ISO-8859-1 ...
    (perl.dbi.users)
  • Re: PEP 263 status check
    ... > chosing windows-1252 as the source encoding. ... in the string module, the string methods and all through ... encoded data (including utf-8 encodings) ... character that is outside of the 7-bit ascii subset. ...
    (comp.lang.python)