recycling internationalized garbage



Hi folks,

Please help me with international string issues:
I put together an AJAX discography search engine

http://www.xfeedme.com/discs/discography.html

using data from the FreeDB music database

http://www.freedb.org/

Unfortunately FreeDB has a lot of junk in it, including
randomly mixed character encodings for international
strings. As an expediency I decided to just delete all
characters that weren't ascii, so I could get the thing
running. Now I look through the log files and notice that
a certain category of user immediatly homes in on this
and finds it amusing to see how badly I've mangled
the strings :(. I presume they chuckle and make
disparaging remarks about "united states of ascii"
and then leave never to return.

Question: what is a good strategy for taking an 8bit
string of unknown encoding and recovering the largest
amount of reasonable information from it (translated to
utf8 if needed)? The string might be in any of the
myriad encodings that predate unicode. Has anyone
done this in Python already? The output must be clean
utf8 suitable for arbitrary xml parsers.

Thanks, -- Aaron Watters

===

As someone once remarked to Schubert
"take me to your leider" (sorry about that).
-- Tom Lehrer

.



Relevant Pages

  • Re: "Ascii" codes to Hex (again)
    ... The result of this code is the hex value 0x0 followed ... Your string will still contain the Unicode characters ... time you get to a stage dealing with encodings, ... dealing with strings - you should be dealing with bytes. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: recycling internationalized garbage
    ... The string might be in any of the ... myriad encodings that predate unicode. ... utf8 suitable for arbitrary xml parsers. ... file .signature infected by signature virus. ...
    (comp.lang.python)
  • Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.
    ... represents a binary object: a string of bytes. ... ruby 1.9 has one String which tries to do both jobs. ... If you jump through the right hoops, ... juggles multiple strings in different encodings all at the same time? ...
    (comp.lang.ruby)
  • Re: Question abut threads
    ... I am using 8 ports for now. ... Maybe you have some bug elsewhere that somehow offsets this bug. ... But if you were to change your design to transmit encodings it would be. ... You may also want to consider changing your code so that it uses a StringBuilder to accumulate the string, rather than repeatedly accumulating the string being read in. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Difference between VS2003 / VS20005 causes CRYPTO BAD DATA exc
    ... The encodings in System.Text are all designed to create different binary ... the unicode characters in a string and a sequence of bytes. ...
    (microsoft.public.dotnet.security)