Unicode and ANSI Common Lisp
From: Adam Warner (usenet_at_consulting.net.nz)
Date: 12/16/04
- Next message: sds: "Re: [CLISP] Interface to PARI"
- Previous message: Ed Symanzik: "Re: (; comment suggestion)"
- Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Unicode and ANSI Common Lisp"
- Reply: Marcin 'Qrczak' Kowalczyk: "Re: Unicode and ANSI Common Lisp"
- Reply: Cameron MacKinnon: "Re: Unicode and ANSI Common Lisp"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Thu, 16 Dec 2004 12:19:07 +1300
Hi all,
I hope you've been following the interesting thread entitled "CLisp case
sensitivity". There is debate over how to count Unicode characters, which
impacts upon the length of a string, individual character access and the
algorithms required to return these values.
This is a brilliant summary of the choices available to implementors:
<http://www.unicode.org/faq/char_combmark.html#7>
There appears to be growing consensus that #2, implementing strings as
sequences of Unicode code points, is the internal format that best
corresponds with the ANSI Common Lisp standard. Anything less exposes the
internal encoding of code points. Anything more (grapheme clusters) is
better left to a higher layer above ANSI Common Lisp.
What I hope people appreciate is that this really matters. Agreeing upon
how characters are counted is the only way for LENGTH to return the same
values across implementations for the same external strings. And this has
profound implications for how CHAR and SETF CHAR operate. It essentially
imposes a minimum internal format upon implementations: strings as
arrays of at least 21-bit values.
In discussing UTF-16 Peter Seibel commented:
Yes, I'm with you so far. That just means that LENGTH has to be
implemented in a smarter way--it has to scan the array of code-points
looking for surrogate pairs in order to determine how many characters
are in the string. (That Java's String.length() method doesn't do this
will no doubt cause no end of problems down the line.)
This isn't the biggest issue. The biggest is that ANSI Common Lisp strings
are mutable and there is no way to insert a surrogate pair within a single
16-bit position. Java has immutable strings that already have to be
converted to mutable int arrays to modify code points.
A big issue with grapheme clusters is that CHAR-CODE-LIMIT is essentially
unlimited. You can't put a sensible number upon it. You could try to limit
it to 63-bits (3 21-bit code points) but there's still the possibility
that someone comes up with a sensible grapheme cluster that exceeds a
sequence of three code points.
Agreement is likely that a CHAR-CODE-LIMIT of #x110000 (2^16+2^20) is the
most sensible way to implement Unicode 4.0.1+ in ANSI Common Lisp. As the
UTF-16 encoding cuts off the possibility of any higher code point values
(surrogate pairs only provide for a maximum of 2^20 extra values) this
limit should be sufficient for generations.
The ultimate implication is there are only two conforming ANSI Common Lisp
implementations with respect to Unicode 4.0.1+: CLISP and SBCL. The other
ones return incorrect values for LENGTH, CHAR, CHAR-CODE, etc. for
particular /external/ Unicode character sequences.
Another implication is an ANSI Common Lisp implementing 16-bit characters
is only conforming with respect to Unicode 3.0. Unicode 3.1 assigned the
first supplementary characters that cannot be mutated correctly within a
16-bit character Lisp implementation and cannot be created via a code
point argument supplied to CODE-CHAR.
Regards,
Adam
- Next message: sds: "Re: [CLISP] Interface to PARI"
- Previous message: Ed Symanzik: "Re: (; comment suggestion)"
- Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Unicode and ANSI Common Lisp"
- Reply: Marcin 'Qrczak' Kowalczyk: "Re: Unicode and ANSI Common Lisp"
- Reply: Cameron MacKinnon: "Re: Unicode and ANSI Common Lisp"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|