Re: java string encoding
From: Jon A. Cruz (jon_at_joncruz.org)
Date: 02/10/04
- Next message: Steve W. Jackson: "Re: Efficient method for drawing many rectangles."
- Previous message: Markus Brosch: "unique identifier for an object?"
- In reply to: Sender: "java string encoding"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 10 Feb 2004 09:24:42 -0800
Sender wrote:
> I have string of Chinese characters encoded with "big5". I convert it to
No, you don't.
> I convert it to
> unicode and use a printBytes method to display the hex value to System.out.
"convert it to unicode" is not actually going on, since Java chars are
*always* Unicode.
Also... System.out is usually broken in regards to printing non-ASCII
(that means characters that don't fall in the range from 0 through 127)
characters.
> Here is the coding:
>
>
> String cname ...... //this is the big5 string
How do you get that cname?
At the point at which you have a Java String, it has 16-bit Unicode for
it's contents. By definition.
> if (cname != null) {
> printBytes(cname.getBytes(), "CName ");
This says "take the sequence of 16-bit Unicode characters living in the
String 'cname' and convert it to a byte array using the default platform
conversion"
> ch_name = new String(cname.getBytes(), "BIG5");
This says "take the sequence of 16-bit Unicode characters living in the
String 'cname' and convert it to a byte array using Unicode->'BIG5'
explicitly for the conversion"
> printBytes(ch_name.getBytes(), "ch_name ");
> String bb = new String(ch_name.getBytes("BIG5"));
Ok. There's a *huge* problem.
You just converted a String from 16-Bit Unicode chars to BIG5 bytes, and
then converted it *back* to a String of 16-bit Unicode chars *but* using
the default platform encoding for it.
That default encoding changes all the time. Never count on it.
> public static void printBytes(byte[] array, String name) {
> System.out.print(name + " = ");
> for (int k = 0; k < array.length; k++) {
> System.out.print("0x" + Common.byteToHex(array[k]) + " ");
> }
> System.out.println();
> }
That's decent. *However*, you need to printChars also
public static void printBytes(String str, String name) {
System.out.print(name + " = ");
for (int k = 0; k < str.length(); k++) {
System.out.print("0x" + Integer.toHexString(0x0ffff &
str.charAt(k)) + " ");
}
System.out.println();
}
>
> And here is the output of printBytes:
>
> CName = 0xb5 0xd8 0xb1 0xe1 0xa4 0xa4 0xb0 0xea
> ch_name = 0x3f 0x3f 0x3f 0x3f
> bb = 0xb5 0xd8 0xb1 0xe1 0xa4 0xa4 0xb0 0xea
>
> As you can see, ch_name became "????". But the last 2 lines of code can
> convert it back to the original big5 string. Why?
Because at one point you said "convert this Java char sequence to bytes
using the default local character encoding" yet at another you said
"convert this Java char sequence to bytes using 'BIG5' to do the conversion.
I would draw the conclusion that you misunderstand Strings in Java.
They do *not* store things in bytes.
They *do* store things in 16-bit unsigned Unicode characters. Always.
String.getBytes() and String.getBytes() *convert* the contents to a byte
array. They do *not* 'access' some internal byte array.
> Why? In fact, what I wanted to do
> is to convert the big5 string to unicode and store it as a varchar column in
> MySql. While I can store the ch_name, it only stored as "????" and
> retrieving it cannot be displayed in ShowString correctly. In other words,
> if convert-and-display, it works, if convert-store-display, it doesn't work.
Yes.
Java Strings are *always* Unicode.
Look to where you go in and out of Strings. Always use explicit
encodings. Never use String.getBytes() or new String(byte[]). Instead
use String.getBytes(String) and new String(byte[], String) exclusively.
Oh, and on all modern MS Windows systems, the user can change the local
encoding with a click on the taskbar. So don't trust it to stay fixed
even on a single machine.
- Next message: Steve W. Jackson: "Re: Efficient method for drawing many rectangles."
- Previous message: Markus Brosch: "unique identifier for an object?"
- In reply to: Sender: "java string encoding"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|