Re: locale specific input




"Miguel De Anda" <miguel@xxxxxxxxxxxxx> wrote in message news:44a0e7b9$0$9844$88260bb3@xxxxxxxxxxxxxxxxxxxx
I'm writing a little app that users rss feeds from a website and I've found
that some feeds contain items in different languages. So far, I've only had
feeds that are in Japanese (eucjp). I've managed to get the feed to save
and display properly by adding the charset to my inputstream (or
inputstreamreader, I forgot). Anyway, the problem I'm having is that I'd
like to be able to read the xml, and then figure out the language that each
item is in. It seems that only a few are in Japanese, and I wouldn't be
surprised if they are sometimes mixed with items from different languages.
I've found the "Java port of Mozilla charset detector" and it works ok, but
it still won't be able to handle what I'm trying to do.

I'm using an rss library to parse the xml and give me simple objects to work
with. I'd hate to parse the xml manually by looking at bytes and then
feeding byte arrays to the charset detection library, this seems like a
dumb way to go (plus it means a lot more work).

Has anybody dealt with this in the past? I can't seem to find any solutions
on the net.

Thanks.

RSS uses XML. My understanding is that XML is by default encoded in UTF-8, and an XML parser should assume it's receiving UTF-8 data until it receives an encoding declaration stating otherwise. In other words, this should all work automatically.

Possibilities why it might not be working:

(1) The RSS library is buggy.
(2) The author of the RSS feed set their encoding declaration incorrectly.

- Oliver

.



Relevant Pages

  • Re: locale specific input
    ... that some feeds contain items in different languages. ... surprised if they are sometimes mixed with items from different languages. ... I'm using an rss library to parse the xml and give me simple objects to work ... My understanding is that XML is by default encoded in UTF-8, and an XML parser should assume it's receiving UTF-8 data until it receives an encoding declaration stating otherwise. ...
    (comp.lang.java)
  • locale specific input
    ... I'm writing a little app that users rss feeds from a website and I've found ... surprised if they are sometimes mixed with items from different languages. ... I'm using an rss library to parse the xml and give me simple objects to work ...
    (comp.lang.java)
  • [ANN] XML Processing in Prolog
    ... The XML of Programming Languages ... What would 'the XML of programming languages' be like? ... Prolog to support research in natural language processing. ...
    (comp.text.xml)
  • Re: Vexille
    ... Flash & RSS & HTML/CSS are just presentational layers. ... take the XML and generate yet another UI, ... own could help with version control over multiple languages. ...
    (rec.arts.anime.misc)
  • Re: XSL - Loop through data twice processing differently
    ... XML document parsing it with XSL? ... XSLT ... them as of stateful loops, ... in imperative languages, ...
    (comp.text.xml)