recycling internationalized garbage
- From: aaronwmail-usenet@xxxxxxxxx
- Date: 8 Mar 2006 06:22:19 -0800
Hi folks,
Please help me with international string issues:
I put together an AJAX discography search engine
http://www.xfeedme.com/discs/discography.html
using data from the FreeDB music database
http://www.freedb.org/
Unfortunately FreeDB has a lot of junk in it, including
randomly mixed character encodings for international
strings. As an expediency I decided to just delete all
characters that weren't ascii, so I could get the thing
running. Now I look through the log files and notice that
a certain category of user immediatly homes in on this
and finds it amusing to see how badly I've mangled
the strings :(. I presume they chuckle and make
disparaging remarks about "united states of ascii"
and then leave never to return.
Question: what is a good strategy for taking an 8bit
string of unknown encoding and recovering the largest
amount of reasonable information from it (translated to
utf8 if needed)? The string might be in any of the
myriad encodings that predate unicode. Has anyone
done this in Python already? The output must be clean
utf8 suitable for arbitrary xml parsers.
Thanks, -- Aaron Watters
===
As someone once remarked to Schubert
"take me to your leider" (sorry about that).
-- Tom Lehrer
.
- Follow-Ups:
- Re: recycling internationalized garbage
- From: Ross Ridge
- Re: recycling internationalized garbage
- From: Fredrik Lundh
- Re: recycling internationalized garbage
- Prev by Date: Re: Type Hinting vs Type Checking and Preconditions
- Next by Date: Having to "print" before method invocation?
- Previous by thread: Re: RAD tutorials and tools for GUI development with Python?
- Next by thread: Re: recycling internationalized garbage
- Index(es):
Relevant Pages
|