Unicode LISP??
From: Ray Dillinger (bear_at_sonic.net)
Date: 09/04/04
- Next message: Arthur Clune: "Re: LIsp on windows"
- Previous message: David R. Sky: "Re: Looking for Nyquist list"
- In reply to: Marcin 'Qrczak' Kowalczyk: "Re: Lisp's other than CLISP that support full Unicode character repertoire?"
- Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Unicode LISP??"
- Reply: Marcin 'Qrczak' Kowalczyk: "Re: Unicode LISP??"
- Reply: Bruno Haible: "Re: Unicode LISP??"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sat, 04 Sep 2004 16:58:29 GMT
Marcin 'Qrczak' Kowalczyk wrote:
> Ray Dillinger <bear@sonic.net> writes:
>
>
>>I think the "right thing" here is actually beyond 21-bit unicode.
>>Unicode codepoints, in many cases, are not characters. I think that
>>the "right thing" with unicode is to allow characters that are a
>>unicode base codepoint followed by any nondefective sequence of
>>combining codepoints.
>
>
> I've seen lots of people saying this.
>
> All of them said something like "I think it would be the right thing
> to do". They didn't yet detail how to manipulate parts of characters
> themselves, how code can talk about a combining character in isolation.
>
> But all designs I've seen actually implemented use either code points,
> or UTF-16 units, or UTF-8 units as elements of string representation.
> Seems much simpler (in decreasing order of simplicity), and easier for
> interoperability.
>
> I think the right thing is to express more interfaces in terms of
> strings rather than characters. Then characters can be code points
> without too many problems.
Okay... If you were designing a LISP, from the ground up, to be
a fully unicode-aware language that "Does Unicode Right" what would
you do?
Things I've considered in various combinations: (I know,
some of these are mutually exclusive choices).
1) Combining codepoints in isolation are members of the
character datatype, but, like control characters and
characters with buckybits in CLTL2, they aren't
string-characters; you can't put them into strings as
independent characters. There are calls that add or
remove combining codepoints to or from any character,
making a string-char if the character they're attached
to is a string-char. There are also compose-char and
decompose-char functions that make and return the
codepoint lists of any character.
2) Strings have multiple simulultaneous indices; they
are normally indexed in terms of grapheme clusters,
but with an optional argument you can specify that
you want to use codepoint indexing instead. There
is a function that returns the grapheme-cluster
index in which a particular codepoint index is found,
and a function that retuns the codepoint index at
which a particular grapheme-cluster index begins.
3) You could abandon any pretense at enforcing
"legitimate" unicode structure on your strings.
Let any character be a sequence of one or more
codepoints of any kind, and strings be a vector
of characters, and leave it to the programmer
to say exactly what he means and keep the string
and grapheme structure straight. This implementation
would come with an extensive collection of procedures
to check for and identify problems with character
and string "well-formed-ness" WRT unicode, but
absolutely every manipulation would be done by
explicit character and codepoint manipulation, and
you only get an error if you try to _output_
something that is ill-formed. This gives the
users the tools they need to build higher-level
string and char libraries that behave in
sensible ways without locking them in.
4) Because case-folding is a real bugger in Unicode,
it might be more practical and in more graceful accord
with the principle of least surprise to make such a
LISP case-sensitive. Case-sensitivity in character
names also allows the established HTML4 names for
the common set of international characters. (For
example, #\Agrave and #\agrave could be different
characters).
5) Strings stop being a subtype of "array." A new
type, "text", includes strings, graphemes, and
codepoints.
- Next message: Arthur Clune: "Re: LIsp on windows"
- Previous message: David R. Sky: "Re: Looking for Nyquist list"
- In reply to: Marcin 'Qrczak' Kowalczyk: "Re: Lisp's other than CLISP that support full Unicode character repertoire?"
- Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Unicode LISP??"
- Reply: Marcin 'Qrczak' Kowalczyk: "Re: Unicode LISP??"
- Reply: Bruno Haible: "Re: Unicode LISP??"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|