Re: Is anything easier to do in java than in lisp?
RobertMaas_at_YahooGroups.Com
Date: 06/12/04
- Next message: Rahul Jain: "Re: Generators in Lisp"
- Previous message: Joe Marshall: "Re: Was not making tail recursion elmination a mistake?"
- Maybe in reply to: RobertMaas_at_YahooGroups.Com: "Re: Is anything easier to do in java than in lisp?"
- Next in thread: RobertMaas_at_YahooGroups.Com: "Re: Is anything easier to do in java than in lisp?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sat, 12 Jun 2004 09:33:47 -0700
> From: Antony Sequeira <usemyfullname@hotmail.com>
> Please see
> http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
Ah, thanks for posting that URL! The document is very enlightening.
> What I understand from the above is -
> Java chars are now just like C chars, only they are fixed to 16 bit
> width, they are not unicode chars.
One slight difference: 8 bits per character could theoretically allow
using the first 128 characters as-is and the last 128 characters only
as parts of larger representations. But already many vendors hav used
those second 128 characters for special purposes, such as
pseudo-graphics characters, and special characters, so there's no
chance of discarding all but the first 128 as directly represented and
using the rest for encoding multi-byte characters. But the Unicode
Consortium has managed to get one block of 16-bit values reserved for
parts of larger character codes before anybody started to use them. So
whereas 8-bit C chars and various codings using them are ambiguous,
16-bit Unicode representation is unambiguous.
The section that describes this, unfortunately, is worded to mis-lead
at the start:
UTF-16 uses sequences of one or two unsigned 16-bit code units to
encode Unicode code points. Values U+0000 to U+FFFF are encoded in one
16-bit unit with the same value.
Not quite correct. Truth is: Values U+0000 to U+D7FF, and U+E000 to
U+FFFF, are encoded in one 16-bit unit with the same value. There are
not, and never will be, any characters assigned to code points in the
range U+D800 to U+DFFF, which are reserved for the use described below:
Supplementary characters are encoded
in two code units, the first from the high-surrogates range (U+D800 to
U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF).
This may seem similar in concept to multi-byte encodings, but there is
an important difference: The values U+D800 to U+DFFF are reserved for
use in UTF-16; no characters are assigned to them as code points. This
means, software can tell for each individual code unit in a string
whether it represents a one-unit character or whether it is the first
or second unit of a two-unit character.
Yes, contradicting the mis-wording earlier, correcting the mistake.
Still, the claim that a java character is a Unicode character is not
correct, in particular whenever any character outside the 16-bit range
occurs. So any java software that is to be of general use in handling
characters must watch for appearance of any surrogate in any UTF-16
stream coming in, and must watch in any single-character input for
appearance of any unicode larger than 16 bits (requires generating two
16-bit values internally) or any unicode in the surrogate range (an
error in the input device). Any java software counting characters in a
UTF-16 string must likewise count pairs of surrogate codes as a single
character. Non-general-purpose software can simply abort whenever it
sees any such problem.
As to the 8-bit encoding of Unicode:
UTF-8 uses sequences of one to four bytes to encode Unicode code
points. U+0000 to U+007F are encoded in one byte, U+0080 to U+07FF in
two bytes, U+0800 to U+FFFF in three bytes, and U+10000 to U+10FFFF in
four bytes.
Technically the three-byte encoding covers only U+0800 to U+D7FF and
U+E000 to U+FFFF, because there are no characters assigned to code
points in the range U+D800 to U+DFFF. Software should probably signal
an error (exception) if it sees any violation of that.
The main decision the JSR-204 expert group had to make was how to
represent supplementary characters in Java APIs, both for individual
characters and for character sequences in all forms. A number of
approaches were considered and rejected by the expert group:
This is where the fun starts. If you're curious, read this part of the
document! All your favorite ideas were rejected!
Here are main points of the decision:
In the end, the decision was for a tiered approach:
* Use the primitive type int to represent code points in low-level
APIs, such as the static methods of the Character class.
* Interpret char sequences in all forms as UTF-16 sequences, and
promote their use in higher-level APIs.
* Provide APIs to easily convert between various char and code
point-based representations.
..
With this approach, a char represents a UTF-16 code unit, which is not
always sufficient to represent a code point. ...
Note that code points will be represented as ints, not as chars, in all
the single-character low-level API methods. But note that
Character.toUpperCase can't work in general, because sometimes the
uppercase of a single character is two characters!
Regarding source code:
For example, the character U+20000 is written as "\uD840\uDC00".
Ugly!
Then of course there's modified UTF-8 which is incompatible with UTF-8,
but is used by the jvm! Not just ugly, but disgusting!
the Java 2 SDK provides a code point input method
which accepts strings of the form "\Uxxxxxx", where the uppercase "U"
indicates that the escape sequence contains six hexadecimal digits,
thus allowing for supplementary characters.
Fun!
As to which version of java gets these changes:
The enhancements are part of version 1.5 of the Java 2 Platform,
Standard Edition (J2SE).
Here on my own ISP, we have jdk1.2.2, which seems to be much earlier,
or am I confused? (I'm looking at directory from whereis command.)
- Next message: Rahul Jain: "Re: Generators in Lisp"
- Previous message: Joe Marshall: "Re: Was not making tail recursion elmination a mistake?"
- Maybe in reply to: RobertMaas_at_YahooGroups.Com: "Re: Is anything easier to do in java than in lisp?"
- Next in thread: RobertMaas_at_YahooGroups.Com: "Re: Is anything easier to do in java than in lisp?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|