Re: HELP: Unicode in Java 1.3.1 vs 1.4.2

From: John C. Bollinger (jobollin_at_indiana.edu)
Date: 02/15/05


Date: Tue, 15 Feb 2005 10:15:07 -0500

modest wrote:

> according to
> http://java.sun.com/docs/books/tutorial/i18n/text/string.html:
>
> "If a byte array contains non-Unicode text, you can convert the text to
> Unicode with one of the String constructor methods. Conversely, you can
> convert a String object into a byte array of non-Unicode characters
> with the String.getBytes method. When invoking either of these methods,
> you specify the encoding identifier as one of the parameters."
>
> It works fine in Java 1.3.1
>
> ------------------------------------------------------------------
> // Convert ASCII to Unicode
> str_uni = new String(str_ascii.getBytes(), "ISO8859_8");
>
> // Convert Unicode to ASCII
> str_ascii = new String(str_uni.getBytes("ISO8859_8"));
> ------------------------------------------------------------------
>
> In Java 1.4.2 it returns question marks only.
>
> What is the difference and how it can be fixed?

You are not using the canonical name of the charset, which is
"ISO-8859-8". Which charsets are available and how they are configured
depends on your Java installation. On my Sun JDK 1.4.2_05 installation,
the charset in question has no defined aliases and therefore can only be
referred to by its canonical name. I don't know why you are getting
anything at all in this case (you should get an
UnsupportedEncodingException if the charset name were unknown).

That said, your code is deeply flawed. If you have data in a Java
String then it is already Unicode, *that is a fundamental characteristic
of Java Strings*. It does not make sense to talk about changing the
encoding / charset of a String -- the concept just doesn't apply (and
the i18n tutorial refer to doesn't suggest otherwise). If you have
taken a byte sequence and created a String from it without accounting
for the bytes' charset then you are already hosed. This may be your
real problem, and it has not changed from 1.3 to 1.4 (or 1.5).

In addition, it might be relevant to you that ASCII, Unicode, and all
the ISO-8859 nationalized charsets all assign the same codes to the
characters covered by ASCII. The UTF-8 charset for encoding Unicode is
produces encoded character codes for the ASCII characters that are the
same as the character codes themselves.

-- 
John Bollinger
jobollin@indiana.edu


Relevant Pages

  • Re: UTF-8 encoding in AJAX web application.
    ... unicode character string can be encoded ... the transfered data mostly contain ASCII characters(since UTF-16 or UCS-2 ... on the column's Charset type, if it is of unicode type(e.g. ...
    (microsoft.public.dotnet.languages.csharp)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • Re: Fast UTF-8 strlen function
    ... >> Is there a fast UTF-8 string length function floating around? ... Length in bytes, or length in characters? ... For UTF-8, the main basic "change" you have to make to your string routines ... then I could individually look up the characters in my UNICODE ...
    (alt.lang.asm)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)