Re: utf-8 encoding

From: Dale King (KingD_at_tmicha.net)
Date: 11/05/03


Date: Wed, 5 Nov 2003 13:16:13 -0500


"Sascha Obermüller" <sing-sing@gmx.net> wrote in message
news:bo9gut$7tc$07$1@news.t-online.com...
> I'm building a Crawler that chop different nationalities websites' text
into
> segments and terms.
> My Problem: I have to transform all used encodings (e.g.: 8849-1 etc.) of
> sites to utf-8 format. How can i do that?

Not difficult at all. When reading the text you will be transforming the
bytes read to Unicode. That is done using an InputStreamReader (the JDK1.4
NIO apis have other ways as well) with the encoder set to the particular
encoding. The list of supported encodings is here
http://java.sun.com/j2se/1.4.2/docs/guide/intl/encoding.doc.html.

Then when ouputting you will use an OutputStreamWriter with the encoding set
to UTF8.

For more information you might want to see the internationalization trail of
the tutorial:
http://java.sun.com/docs/books/tutorial/i18n/index.html

And this section in particular:
http://java.sun.com/docs/books/tutorial/i18n/text/convertintro.html

--
 Dale King


Relevant Pages

  • Re: utf-8 encoding
    ... >> segments and terms. ... When reading the text you will be transforming the ... > NIO apis have other ways as well) with the encoder set to the particular ... > Then when ouputting you will use an OutputStreamWriter with the encoding ...
    (comp.lang.java.help)
  • Re: Developing a Network Encoder
    ... The bulk of the encoding task is the audio/video compression ... the option of dividing the encoding job up into segments and ... off-line encodes, not live ones. ...
    (microsoft.public.windowsmedia.sdk)