Re: recycling internationalized garbage
- From: garabik-news-2005-05@xxxxxxxxxxxxxxxxxxxxxxxx
- Date: Wed, 8 Mar 2006 15:04:59 +0000 (UTC)
Fredrik Lundh <fredrik@xxxxxxxxxxxxxx> wrote:
"aaronwmail-usenet@xxxxxxxxx" wrote:
Question: what is a good strategy for taking an 8bit
string of unknown encoding and recovering the largest
amount of reasonable information from it (translated to
utf8 if needed)? The string might be in any of the
myriad encodings that predate unicode. Has anyone
done this in Python already? The output must be clean
utf8 suitable for arbitrary xml parsers.
some alternatives:
braindead bruteforce:
try to do strict decoding as utf-8. if you succeed, you have an utf-8
string. if not, assume iso-8859-1.
that was a mistake I made once.
Do not use iso8859-1 as python codec, instead create your own codec
called e.g. iso8859-1-ncc like this (just a sketch):
decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32, 256))
decoding_map.update({})
encoding_map = codecs.make_encoding_map(decoding_map)
and then use :
def try_encoding(s, encodings):
"try to guess the encoding of string s, testing encodings given in second parameter"
for enc in encodings:
try:
test = unicode(s, enc)
return enc
except UnicodeDecodeError, r:
pass
return None
guessed_unicode_text = try_encodings(text, ['utf-8', 'iso8859-1-ncc', 'cp1252', 'macroman'])
it seems to work surprisingly well, if you know approximately the
language(s) the text is expected to be in (e.g. replace cp1252 with
cp1250, iso8859-1-ncc with iso8859-2-ncc for central european languages)
--
-----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
.
- References:
- recycling internationalized garbage
- From: aaronwmail-usenet
- Re: recycling internationalized garbage
- From: Fredrik Lundh
- recycling internationalized garbage
- Prev by Date: Re: Having to "print" before method invocation?
- Next by Date: Re: Having to "print" before method invocation?
- Previous by thread: Re: recycling internationalized garbage
- Next by thread: Re: recycling internationalized garbage
- Index(es):
Relevant Pages
|