Re: locale specific input
- From: "John W. Kennedy" <jwkenne@xxxxxxxxxxxxx>
- Date: Thu, 29 Jun 2006 15:53:34 -0400
Oliver Wong wrote:
"Miguel De Anda" <miguel@xxxxxxxxxxxxx> wrote in message news:44a0e7b9$0$9844$88260bb3@xxxxxxxxxxxxxxxxxxxxI'm writing a little app that users rss feeds from a website and I've found
that some feeds contain items in different languages. So far, I've only had
feeds that are in Japanese (eucjp). I've managed to get the feed to save
and display properly by adding the charset to my inputstream (or
inputstreamreader, I forgot). Anyway, the problem I'm having is that I'd
like to be able to read the xml, and then figure out the language that each
item is in. It seems that only a few are in Japanese, and I wouldn't be
surprised if they are sometimes mixed with items from different languages.
I've found the "Java port of Mozilla charset detector" and it works ok, but
it still won't be able to handle what I'm trying to do.
I'm using an rss library to parse the xml and give me simple objects to work
with. I'd hate to parse the xml manually by looking at bytes and then
feeding byte arrays to the charset detection library, this seems like a
dumb way to go (plus it means a lot more work).
Has anybody dealt with this in the past? I can't seem to find any solutions
on the net.
Thanks.
RSS uses XML. My understanding is that XML is by default encoded in UTF-8, and an XML parser should assume it's receiving UTF-8 data until it receives an encoding declaration stating otherwise. In other words, this should all work automatically.
It's a little more complicated.
A) The encoding can be specified externally (e.g., by an HTTP header).
B) If it is not specified, it may be either UTF-8 or UTF-16 without an encoding declaration.
C) And if there /is/ an encoding declaration, it is necessary to make at least an approximate guess in order to read the encoding declaration.
The matter is discussed at <URL:http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing>.
--
John W. Kennedy
"The blind rulers of Logres
Nourished the land on a fallacy of rational virtue."
-- Charles Williams. "Taliessin through Logres: Prelude"
.
- References:
- locale specific input
- From: Miguel De Anda
- Re: locale specific input
- From: Oliver Wong
- locale specific input
- Prev by Date: Re: Writing apps for Windows platform in Java? Why?
- Next by Date: Re: Help with -classpath and packages
- Previous by thread: Re: locale specific input
- Next by thread: can't fully unload a URLClassLoader
- Index(es):
Relevant Pages
|