Re: JNI / localization / filenames
From: Chris Uppal (chris.uppal_at_metagnostic.REMOVE-THIS.org)
Date: 01/21/04
- Next message: Bjorn-Ove.Heimsund_at_uib.no: "Re: Mars Rover Controlled By Java"
- Previous message: Noons: "Re: choices regarding where to place code - in the database or middletier"
- In reply to: Jon A. Cruz: "Re: JNI / localization / filenames"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 21 Jan 2004 11:14:22 -0000
Jon A. Cruz wrote:
> > OTOH. There is no such thing as UTF-16 in JNI. Just the option of
> > using 16-bit quantities to represent Java 16-bit 'char's directly --
> > but that's not UTF-16.
>
> Sun experts in the area would disagree with you on that. They definitely
> consider it UTF-16 and not UCS-2, and are continually adding more
> support for UTF-16 details in implementations, including more support of
> surrogate pairs, etc. JSR 204 has more info.
I don't think so, looking at JSR 204 (thanks for the pointer, I hadn't seen it
before) I get the impression that they are attempting to come to terms with the
fact that Java and Unicode don't match.
--- the intro to JSR 204 ---
Version 3.1 of the Unicode standard is the first one to define characters that
cannot be described by single 16-bit code points and thus the standard breaks a
fundamental assumption of the Java programming language and APIs. This JSR
defines the necessary adjustments to the Java APIs to enable support for such
characters and enables the Java platform to continue to track the Unicode
standard.
----------
Unfortunately the rest of the JSR paper doesn't seem to provide much
information.
However it seems clear that they are following the Unicode standard's very
unfortunate wording and thinking of characters with code points > 2**16 as
somehow "additional", maybe not "real Unicode characters".
Unicode cannot be represented by 16-bit characters.
Sequences of Unicode characters (up to 24-bit) can be represented as sequences
of 16-bit quantities using UTF-16. However neither Java Strings, nor the
arrays of jchar manipulated by JNI are in this encoding. Java/JNI use a direct
encoding of Java's 16-bit characters as (probably) "unsigned short" in JNI.
That isn't UTF-16. Java's encoding is neither upward or downward compatible
with UTF-16 (though there are many sequences of characters that are encoded the
same way in both.)
Granted, a Java String could be used to hold a UTF-16 sequence, (but then so
could a char[], a short[], or a byte[], or -- hell -- even a double[] since
it's only a string of bytes). But the Java "char"s in such a sequence are not
the same as the Unicode "character"s in the same collection of bits.
That's why I say that Java doesn't support Unicode, and that Java Strings, and
char[]s, are *not* UTF-16.
The good people working on JSR204 will have to find a way to work this out.
I'd guess that they'll introduce new APIs for using Strings and char[]s to hold
UTF16-encoded data, and have int-returning methods that (e.g.) know how to do
the decoding to find the (say) 8th Unicode character in a String. The use of
the "char" primitive datatype will start to look very dodgy indeed. With luck
they'll also define a few UnicodeString classes which separate the interface
from the representation, and (internally) encode the Unicode data in
programmer-selectable ways.
However none of that has happened yet. When it does it will introduce another
boat-load of complexity into the Java programmers life, and invalidate
(partially) a load of text handling code that already exists. They are going
to have a very hard time trying to sell this stuff to the community, and their
job won't be made easier by the fact that Sun has traditionally blurred the
differences between the Java APIs and real Unicode -- such as the many APIs
that falsely claim to talk UTF-8.
-- chris
- Next message: Bjorn-Ove.Heimsund_at_uib.no: "Re: Mars Rover Controlled By Java"
- Previous message: Noons: "Re: choices regarding where to place code - in the database or middletier"
- In reply to: Jon A. Cruz: "Re: JNI / localization / filenames"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|