Re: UNICODE in Java Help

From: Chris Smith (cdsmith_at_twu.net)
Date: 05/27/04


Date: Thu, 27 May 2004 13:35:29 -0600

Nicholas Pappas wrote:
> This is the block of code in my loader which reads the strings
> from the file:

[...]

> /** read in the 40 byte buffer */
> in.read(bmpPath);
>
> /** trim buffer to length and store */
> for (len=0; len < 40; len++) {
> if (bmpPath[len] == 0)
> break;
> }
> textures[i] = new String(bmpPath, 0, len);

[...]

> Does anyone have any suggestions on how I fix this so I can read the
> Korean text in both Windows and Linux (and other OSs)?

You need a basic understanding of the relationship between bytes and
characters, and of the concept of character encodings. And you need
more information about your input file; specifically, what character
encoding it uses. There are a number of potential problems here:

1. (Actually not related to character encodings) Your call to in.read is
flawed. Take a look at the API documentation for that method.
Specifically, the method is not guaranteed to read the entire array. It
is only specified to read at least one byte but not more than the length
of the array, and to return to number of bytes that it has read. If you
want to read the entire byte array, you'll need to write a loop; sorta
like this:

    int pos = 0;
    while (pos < bmpPath.length)
    {
        int len = in.read(bmpPath, pos, bmpPath.length - pos);

        if (len == -1) handlePrematureEOF();
        else pos += len;
    }

Of course, handlePrematureEOF() should be replaced with appropriate
error-handling code, such as throwing an exception indicating the bad
file format.

2. You don't specify an encoding when you convert the data in the byte
array to text. That data was encoding in some specific encoding when
the file was written. The code you've written will work only if you get
lucky and the platform-default character encoding happens to match the
encoding in the file. To make this work reliably in a cross-platform
way, you need to discover what encoding was used in the file, and
specify that in a separate parameter, for example:

    textures[i] = new String(bmpPath, 0, len, "UTF-8");

(That gets you UTF-8 encoding, which is probably a decent guess; but you
need to find out the real encoding to be sure this will work. It should
be documented with the file format spec.)

3. This is a bit of a subtle one, actually. The test for bytes to equal
zero, which you use to determine the end of the String, will not work
reliably across character encodings. In any multi-byte character
encoding, there's a chance that there will be an embedded zero byte
inside of a character, but the character code itself will be non-zero.

To work around this, you need to swap the order. If your strings are
null-terminated, then convert your byte array to characters first, then
look for a null character (i.e., Unicode value zero), rather than a zero
byte. That looks like this:

    InputStreamReader in = new InputStreamReader(
        new ByteArrayInputStream(bmpPath), "UTF-8");
    StringWriter sw = new StringWriter();

    int c;
    while (c > 0) sw.write((char) c);

    textures[i] = sw.toString();

This is an alternative to the String constructor you used to convert to
characters, and notice that you still need to know the proper character
encoding.

Hope that gets you started,

-- 
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation


Relevant Pages

  • Re: UNICODE in Java Help
    ... characters, and of the concept of character encodings. ... the method is not guaranteed to read the entire array. ... You don't specify an encoding when you convert the data in the byte ... zero, which you use to determine the end of the String, will not work ...
    (comp.lang.java.help)
  • =?ISO-8859-1?Q?Re=3A_How_to_upload_a_=A3?=
    ... A reference to a character that will display as this glyph ... Correctly encoding some bytes so as to be recognised as this ... ASCII-like encodings are old and only cope with a character set of up ... straight for UTF-8. ...
    (alt.html)
  • Re: Writing to the newsgroup?
    ... you should be able to set the encoding and use the encoding you ... I'm not familiear with Unitype Global writer, ... However, if you use its help feature to inquire about 'character encoding', ... Here's the UTF-8 test. ...
    (sci.lang.japan)
  • Re: Stream and Encoding Confusion
    ... We are each writing programs to read an input file and count the number of ... a simple list that says we the program found so many of each character; ... treated as a character stream or a byte stream. ... I'm also somewhat concerned about encoding. ...
    (comp.lang.java.programmer)
  • Re: [PHP] First stupid post of the year. [SOLVED]
    ... one can argue how many bytes are needed to represent a character ... in what encoding, but that doesn't change the character. ... Unicode it is called U+00A0. ... there are a few ways to encode U+00A0. ...
    (php.general)