Unicode LISP??

From: Ray Dillinger (bear_at_sonic.net)
Date: 09/04/04


Date: Sat, 04 Sep 2004 16:58:29 GMT

Marcin 'Qrczak' Kowalczyk wrote:
> Ray Dillinger <bear@sonic.net> writes:
>
>
>>I think the "right thing" here is actually beyond 21-bit unicode.
>>Unicode codepoints, in many cases, are not characters. I think that
>>the "right thing" with unicode is to allow characters that are a
>>unicode base codepoint followed by any nondefective sequence of
>>combining codepoints.
>
>
> I've seen lots of people saying this.
>
> All of them said something like "I think it would be the right thing
> to do". They didn't yet detail how to manipulate parts of characters
> themselves, how code can talk about a combining character in isolation.
>
> But all designs I've seen actually implemented use either code points,
> or UTF-16 units, or UTF-8 units as elements of string representation.
> Seems much simpler (in decreasing order of simplicity), and easier for
> interoperability.
>
> I think the right thing is to express more interfaces in terms of
> strings rather than characters. Then characters can be code points
> without too many problems.

Okay... If you were designing a LISP, from the ground up, to be
a fully unicode-aware language that "Does Unicode Right" what would
you do?

Things I've considered in various combinations: (I know,
some of these are mutually exclusive choices).

1) Combining codepoints in isolation are members of the
    character datatype, but, like control characters and
    characters with buckybits in CLTL2, they aren't
    string-characters; you can't put them into strings as
    independent characters. There are calls that add or
    remove combining codepoints to or from any character,
    making a string-char if the character they're attached
    to is a string-char. There are also compose-char and
    decompose-char functions that make and return the
    codepoint lists of any character.

2) Strings have multiple simulultaneous indices; they
    are normally indexed in terms of grapheme clusters,
    but with an optional argument you can specify that
    you want to use codepoint indexing instead. There
    is a function that returns the grapheme-cluster
    index in which a particular codepoint index is found,
    and a function that retuns the codepoint index at
    which a particular grapheme-cluster index begins.

3) You could abandon any pretense at enforcing
    "legitimate" unicode structure on your strings.
    Let any character be a sequence of one or more
    codepoints of any kind, and strings be a vector
     of characters, and leave it to the programmer
    to say exactly what he means and keep the string
    and grapheme structure straight. This implementation
    would come with an extensive collection of procedures
    to check for and identify problems with character
    and string "well-formed-ness" WRT unicode, but
    absolutely every manipulation would be done by
    explicit character and codepoint manipulation, and
    you only get an error if you try to _output_
    something that is ill-formed. This gives the
    users the tools they need to build higher-level
    string and char libraries that behave in
    sensible ways without locking them in.

4) Because case-folding is a real bugger in Unicode,
    it might be more practical and in more graceful accord
    with the principle of least surprise to make such a
    LISP case-sensitive. Case-sensitivity in character
    names also allows the established HTML4 names for
    the common set of international characters. (For
    example, #\Agrave and #\agrave could be different
    characters).

5) Strings stop being a subtype of "array." A new
    type, "text", includes strings, graphemes, and
    codepoints.



Relevant Pages

  • Re: More MSDN lies: RtlStringCchLength
    ... You mix up characters and codepoints... ... because some parts of MSDN use "characters" to really mean ... Microsoft's Unicode is a subset of real Unicode (except for a few ...
    (microsoft.public.win32.programmer.kernel)
  • Re: Rant on character sets
    ... And a keyboard big enough to display 65,000-odd characters. ... Unicode defines more than 2^16 codepoints in its extended state. ...
    (comp.programming)
  • Re: Lisps other than CLISP that support full Unicode character repertoire?
    ... > Unicode codepoints, in many cases, are not characters. ... They didn't yet detail how to manipulate parts of characters ... or UTF-8 units as elements of string representation. ...
    (comp.lang.lisp)
  • Re: More MSDN lies: RtlStringCchLength
    ... and in Chinese codepages these characters also have codepoints larger than 0xFF. ... But I'm still not sure if TCHARs are supposed to exist in kernel mode or not -- although ntddk.h and wdm.h export definitions of some subset of the user-mode TCHAR stuff, it seems that maybe that's a bug and maybe these headers weren't supposed to export any TCHAR definitions. ...
    (microsoft.public.win32.programmer.kernel)
  • GetTextExtentExPoint slow for characters greater than codepoint 127
    ... contains codepoints above 127. ... below, GetTextExtentExPointW is 37% slower than when called with a 30,000 ... character string composed of only characters below codepoint 127. ... use GetTextExtentExPointW to determine which characters within strings fits ...
    (microsoft.public.win32.programmer.gdi)