Unicode and ANSI Common Lisp

From: Adam Warner (usenet_at_consulting.net.nz)
Date: 12/16/04


Date: Thu, 16 Dec 2004 12:19:07 +1300

Hi all,

I hope you've been following the interesting thread entitled "CLisp case
sensitivity". There is debate over how to count Unicode characters, which
impacts upon the length of a string, individual character access and the
algorithms required to return these values.

This is a brilliant summary of the choices available to implementors:
<http://www.unicode.org/faq/char_combmark.html#7>

There appears to be growing consensus that #2, implementing strings as
sequences of Unicode code points, is the internal format that best
corresponds with the ANSI Common Lisp standard. Anything less exposes the
internal encoding of code points. Anything more (grapheme clusters) is
better left to a higher layer above ANSI Common Lisp.

What I hope people appreciate is that this really matters. Agreeing upon
how characters are counted is the only way for LENGTH to return the same
values across implementations for the same external strings. And this has
profound implications for how CHAR and SETF CHAR operate. It essentially
imposes a minimum internal format upon implementations: strings as
arrays of at least 21-bit values.

In discussing UTF-16 Peter Seibel commented:

   Yes, I'm with you so far. That just means that LENGTH has to be
   implemented in a smarter way--it has to scan the array of code-points
   looking for surrogate pairs in order to determine how many characters
   are in the string. (That Java's String.length() method doesn't do this
   will no doubt cause no end of problems down the line.)

This isn't the biggest issue. The biggest is that ANSI Common Lisp strings
are mutable and there is no way to insert a surrogate pair within a single
16-bit position. Java has immutable strings that already have to be
converted to mutable int arrays to modify code points.

A big issue with grapheme clusters is that CHAR-CODE-LIMIT is essentially
unlimited. You can't put a sensible number upon it. You could try to limit
it to 63-bits (3 21-bit code points) but there's still the possibility
that someone comes up with a sensible grapheme cluster that exceeds a
sequence of three code points.

Agreement is likely that a CHAR-CODE-LIMIT of #x110000 (2^16+2^20) is the
most sensible way to implement Unicode 4.0.1+ in ANSI Common Lisp. As the
UTF-16 encoding cuts off the possibility of any higher code point values
(surrogate pairs only provide for a maximum of 2^20 extra values) this
limit should be sufficient for generations.

The ultimate implication is there are only two conforming ANSI Common Lisp
implementations with respect to Unicode 4.0.1+: CLISP and SBCL. The other
ones return incorrect values for LENGTH, CHAR, CHAR-CODE, etc. for
particular /external/ Unicode character sequences.

Another implication is an ANSI Common Lisp implementing 16-bit characters
is only conforming with respect to Unicode 3.0. Unicode 3.1 assigned the
first supplementary characters that cannot be mutated correctly within a
16-bit character Lisp implementation and cannot be created via a code
point argument supplied to CODE-CHAR.

Regards,
Adam



Relevant Pages

  • Re: How to check variables for uniqueness ?
    ... characters is the sequence SS. ... is simply capitalizing strings. ... The fact that case mapping in English /is/ simple is neither here not ... That is a fair criticism of the Unicode position. ...
    (comp.lang.java.programmer)
  • Re: Optimization of code
    ... that leet alphabet, with excessive accents. ... Latest MSVC releases can handle UNICODE C sources, ... Swedish, German, French, Hungarian, etc. that use accented characters). ... that require ASCII text strings as part of their protocol. ...
    (microsoft.public.vc.mfc)
  • Re: UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux & Windows XP
    ... For string literals, with the "coding" declaration, Python will accept ... "coding" declaration to produce a Unicode object which unambiguously ... represents the sequence of characters - ie. something that can be ... > strings and/or gibberished characters in Tk GUI title? ...
    (comp.lang.python)
  • Re: Generic innerHTML functionality and other minor questions...
    ... > Now i know that all strings in JavaScript 1.0 are in Unicode ... strings are encoded using UTF-16 in accordance with ECMAScript. ... which was the first JScript version to support encodeURIComponent(). ... RFC2986) will be used for characters below code point 0x80 and UTF-8 ...
    (comp.lang.javascript)
  • Re: Generic innerHTML functionality and other minor questions...
    ... > Now i know that all strings in JavaScript 1.0 are in Unicode ... strings are encoded using UTF-16 in accordance with ECMAScript. ... which was the first JScript version to support encodeURIComponent(). ... RFC2986) will be used for characters below code point 0x80 and UTF-8 ...
    (comp.lang.javascript)