Re: recycling internationalized garbage



Fredrik Lundh <fredrik@xxxxxxxxxxxxxx> wrote:
"aaronwmail-usenet@xxxxxxxxx" wrote:

Question: what is a good strategy for taking an 8bit
string of unknown encoding and recovering the largest
amount of reasonable information from it (translated to
utf8 if needed)? The string might be in any of the
myriad encodings that predate unicode. Has anyone
done this in Python already? The output must be clean
utf8 suitable for arbitrary xml parsers.

some alternatives:

braindead bruteforce:

try to do strict decoding as utf-8. if you succeed, you have an utf-8
string. if not, assume iso-8859-1.

that was a mistake I made once.
Do not use iso8859-1 as python codec, instead create your own codec
called e.g. iso8859-1-ncc like this (just a sketch):

decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32, 256))
decoding_map.update({})
encoding_map = codecs.make_encoding_map(decoding_map)

and then use :

def try_encoding(s, encodings):
"try to guess the encoding of string s, testing encodings given in second parameter"

for enc in encodings:
try:
test = unicode(s, enc)
return enc
except UnicodeDecodeError, r:
pass

return None


guessed_unicode_text = try_encodings(text, ['utf-8', 'iso8859-1-ncc', 'cp1252', 'macroman'])


it seems to work surprisingly well, if you know approximately the
language(s) the text is expected to be in (e.g. replace cp1252 with
cp1250, iso8859-1-ncc with iso8859-2-ncc for central european languages)

--
-----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
.



Relevant Pages

  • Re: "Ascii" codes to Hex (again)
    ... The result of this code is the hex value 0x0 followed ... Your string will still contain the Unicode characters ... time you get to a stage dealing with encodings, ... dealing with strings - you should be dealing with bytes. ...
    (microsoft.public.dotnet.languages.csharp)
  • Converting text between various encodings
    ... I'm playing with converting text strings between various encodings like ... Unicode and UTF8 and UTF7. ... a string to be converted and a long integer ...
    (microsoft.public.scripting.vbscript)
  • Re: Difference between VS2003 / VS20005 causes CRYPTO BAD DATA exc
    ... The encodings in System.Text are all designed to create different binary ... the unicode characters in a string and a sequence of bytes. ...
    (microsoft.public.dotnet.security)
  • Re: Question abut threads
    ... I am using 8 ports for now. ... Maybe you have some bug elsewhere that somehow offsets this bug. ... But if you were to change your design to transmit encodings it would be. ... You may also want to consider changing your code so that it uses a StringBuilder to accumulate the string, rather than repeatedly accumulating the string being read in. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: How to detect string charset
    ... Unfortunately I don't have a real charset header to check. ... rely only on input string. ... The multibyte encodings can be often distinguished by their structure ... If you know the language you can ...
    (comp.lang.ruby)