Re: converting unicode to UTF-8

From: Chris Uppal (chris.uppal_at_metagnostic.REMOVE-THIS.org)
Date: 11/20/04


Date: Sat, 20 Nov 2004 12:59:12 -0000

peter10 wrote:

> ByteArrayOutputStream out = new ByteArrayOutputStream();
> DataOutputStream dataOut = new DataOutputStream(out);
> dataOut.writeUTF(text_input);

The first problem here is that writeUTF8() does /NOT/ write UTF-8. It's an
incredibly, unbelievably, stupidly, misleadingly-named method. What it does is
write a two-byte character count (as Steve has already mentioned) followed by
some bytes that represent the string in a format that is (conceptually) related
to, but completely incompatible with, UTF-8.

UTF-8 is a a way of taking a stream/string of Unicode characters (and Java
Strings can be viewed as such, although the correspondence is not as close as
it looks), and representing them as bytes in a binary stream or similar. In
Java that conversion is ultimately provided by a "charset", specifically the
one named "UTF-8". Probably the easiest way for you to use that would be
either to ask your String for its
    aString.getBytes("UTF-8");
or to use an OutputStreamWriter constructed with a 'charsetname' of "UTF-8".

    -- chris



Relevant Pages

  • Re: Changing the default charset for composing messages
    ... > correct default for the localized version of Entourage you're using. ... > UTF-8 if your message contains characters from more than one character set. ... > will just choose the correct charset on the basis of the characters you've ...
    (microsoft.public.mac.office.entourage)
  • Re: Unicode Emails vom Server als HTML files sichern oder so aehnlich..
    ... nicht UTF-8. ... ignoring text in character set `ISO-2022-JP' ... The returned string is in internal perl string representation and has ...
    (de.comp.lang.perl.misc)
  • Re: Defacto standard string library
    ... string manipulation code works as well and correctly with UTF-8 ... multibyte character strings as it does with ASCII strings. ... sequence is 0xC2 (when encoding character value 0x80). ...
    (comp.lang.c)
  • Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
    ... Wide character in print at -e line 1. ... The differences are in the encoding of the source file (UTF-8 vs. ... the string constant was converted to a character string: ...
    (comp.lang.perl.misc)
  • Re: PHP5 and Double Byte (experts wanted)
    ... :> is able to sort all character sets correctly. ... :> string functions have multibyte aquivalents. ... string routines to search through the utf-string, ... utf-8 aware routines and looking for a character, ...
    (comp.lang.php)