Re: Is anything easier to do in java than in lisp?

RobertMaas_at_YahooGroups.Com
Date: 06/12/04


Date: Sat, 12 Jun 2004 09:33:47 -0700


> From: Antony Sequeira <usemyfullname@hotmail.com>
> Please see
> http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

Ah, thanks for posting that URL! The document is very enlightening.

> What I understand from the above is -
> Java chars are now just like C chars, only they are fixed to 16 bit
> width, they are not unicode chars.

One slight difference: 8 bits per character could theoretically allow
using the first 128 characters as-is and the last 128 characters only
as parts of larger representations. But already many vendors hav used
those second 128 characters for special purposes, such as
pseudo-graphics characters, and special characters, so there's no
chance of discarding all but the first 128 as directly represented and
using the rest for encoding multi-byte characters. But the Unicode
Consortium has managed to get one block of 16-bit values reserved for
parts of larger character codes before anybody started to use them. So
whereas 8-bit C chars and various codings using them are ambiguous,
16-bit Unicode representation is unambiguous.

The section that describes this, unfortunately, is worded to mis-lead
at the start:
   UTF-16 uses sequences of one or two unsigned 16-bit code units to
   encode Unicode code points. Values U+0000 to U+FFFF are encoded in one
   16-bit unit with the same value.
Not quite correct. Truth is: Values U+0000 to U+D7FF, and U+E000 to
U+FFFF, are encoded in one 16-bit unit with the same value. There are
not, and never will be, any characters assigned to code points in the
range U+D800 to U+DFFF, which are reserved for the use described below:
                                     Supplementary characters are encoded
   in two code units, the first from the high-surrogates range (U+D800 to
   U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF).
   This may seem similar in concept to multi-byte encodings, but there is
   an important difference: The values U+D800 to U+DFFF are reserved for
   use in UTF-16; no characters are assigned to them as code points. This
   means, software can tell for each individual code unit in a string
   whether it represents a one-unit character or whether it is the first
   or second unit of a two-unit character.
Yes, contradicting the mis-wording earlier, correcting the mistake.

Still, the claim that a java character is a Unicode character is not
correct, in particular whenever any character outside the 16-bit range
occurs. So any java software that is to be of general use in handling
characters must watch for appearance of any surrogate in any UTF-16
stream coming in, and must watch in any single-character input for
appearance of any unicode larger than 16 bits (requires generating two
16-bit values internally) or any unicode in the surrogate range (an
error in the input device). Any java software counting characters in a
UTF-16 string must likewise count pairs of surrogate codes as a single
character. Non-general-purpose software can simply abort whenever it
sees any such problem.

As to the 8-bit encoding of Unicode:
   UTF-8 uses sequences of one to four bytes to encode Unicode code
   points. U+0000 to U+007F are encoded in one byte, U+0080 to U+07FF in
   two bytes, U+0800 to U+FFFF in three bytes, and U+10000 to U+10FFFF in
   four bytes.
Technically the three-byte encoding covers only U+0800 to U+D7FF and
U+E000 to U+FFFF, because there are no characters assigned to code
points in the range U+D800 to U+DFFF. Software should probably signal
an error (exception) if it sees any violation of that.

   The main decision the JSR-204 expert group had to make was how to
   represent supplementary characters in Java APIs, both for individual
   characters and for character sequences in all forms. A number of
   approaches were considered and rejected by the expert group:
This is where the fun starts. If you're curious, read this part of the
document! All your favorite ideas were rejected!

Here are main points of the decision:
   In the end, the decision was for a tiered approach:
     * Use the primitive type int to represent code points in low-level
       APIs, such as the static methods of the Character class.
     * Interpret char sequences in all forms as UTF-16 sequences, and
       promote their use in higher-level APIs.
     * Provide APIs to easily convert between various char and code
       point-based representations.
..
   With this approach, a char represents a UTF-16 code unit, which is not
   always sufficient to represent a code point. ...
Note that code points will be represented as ints, not as chars, in all
the single-character low-level API methods. But note that
Character.toUpperCase can't work in general, because sometimes the
uppercase of a single character is two characters!

Regarding source code:
   For example, the character U+20000 is written as "\uD840\uDC00".
Ugly!

Then of course there's modified UTF-8 which is incompatible with UTF-8,
but is used by the jvm! Not just ugly, but disgusting!

   the Java 2 SDK provides a code point input method
   which accepts strings of the form "\Uxxxxxx", where the uppercase "U"
   indicates that the escape sequence contains six hexadecimal digits,
   thus allowing for supplementary characters.
Fun!

As to which version of java gets these changes:
   The enhancements are part of version 1.5 of the Java 2 Platform,
   Standard Edition (J2SE).
Here on my own ISP, we have jdk1.2.2, which seems to be much earlier,
or am I confused? (I'm looking at directory from whereis command.)



Relevant Pages

  • Re: CFile ops using char or TCHAR
    ... >just noticed that my file handling class is working chars not TCHARs. ... >Do I need to be working in wide chars for CFile operations, ... you need to use the type of characters you want ... CE is heavily biased towards UNICODE. ...
    (microsoft.public.windowsce.embedded.vc)
  • Re: JNI / localization / filenames
    ... fact that Java and Unicode don't match. ... fundamental assumption of the Java programming language and APIs. ... characters and enables the Java platform to continue to track the Unicode ... of 16-bit quantities using UTF-16. ...
    (comp.lang.java.programmer)
  • Re: drawString with special Unicode characters to Graphics object
    ... 4.1.0-alpha does not support Unicode, so there is the first problem. ... some improvement in that I can now see special characters IF I query ... displays the string "Haussömmern", incl. ... Java program does things like this: ...
    (comp.lang.java.programmer)
  • Javac-compilor error
    ... discipline id.e.programming Java. ... from standard input and writes to standard output, but it is possible to redirect the input ... error occurs while trying to open the file, an exception of type IllegalArgumentException ... then this number of characters, then extra spaces are added to the front of x to bring ...
    (Fedora)
  • Re: Special characters in shortcut menus
    ... caption of the shortcut bar. ... Many of the chars don't error but are unprintable. ... the ChrW function which supports UniCode chars from 1 to 65536, ... > Does anyone have or know where I can find a listing of special characters ...
    (microsoft.public.excel.programming)