Re: Lisp's other than CLISP that support full Unicode character repertoire?
From: Marcin 'Qrczak' Kowalczyk (qrczak_at_knm.org.pl)
Date: 09/02/04
- Next message: Anne & Lynn Wheeler: "Re: Xah Lee's Unixism"
- Previous message: Andre Majorel: "Re: Xah Lee's Unixism"
- In reply to: Ray Dillinger: "Re: Lisp's other than CLISP that support full Unicode character repertoire?"
- Next in thread: Ray Dillinger: "Unicode LISP??"
- Reply: Ray Dillinger: "Unicode LISP??"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Thu, 02 Sep 2004 23:55:49 +0200
Ray Dillinger <bear@sonic.net> writes:
> I think the "right thing" here is actually beyond 21-bit unicode.
> Unicode codepoints, in many cases, are not characters. I think that
> the "right thing" with unicode is to allow characters that are a
> unicode base codepoint followed by any nondefective sequence of
> combining codepoints.
I've seen lots of people saying this.
All of them said something like "I think it would be the right thing
to do". They didn't yet detail how to manipulate parts of characters
themselves, how code can talk about a combining character in isolation.
But all designs I've seen actually implemented use either code points,
or UTF-16 units, or UTF-8 units as elements of string representation.
Seems much simpler (in decreasing order of simplicity), and easier for
interoperability.
I think the right thing is to express more interfaces in terms of
strings rather than characters. Then characters can be code points
without too many problems.
There is no universal character boundary. Some algorithms need e.g.
grapheme cluster boundaries which are defined differently. You haven't
considered what to do with ZWJ and ZWNJ.
Code points are the smallest common denominator. Algorithms like case
mapping and collation are defined in terms of code points. I guess
most Lisps aren't capable of representing code point strings natively
yet. It's hard enough for them to accept that 256 characters are not
enough for everyone.
> If we can do that, then there's only about a dozen ligatures and
> sharp-s that change the string length on a case change.
Don't forget that some case mappings are contextual (e.g. sigma), so
even ignoring ß string-downcase in Unicode is *not* equivalent to
mapping each character through char-downcase, or Greeks will be upset.
> NFC is a vastly larger character repertoire than (and a proper
> superset of) NFKC. How does a Lisp work properly with both?
Hint: it's much easier to apply a transformation when needed than to
undo a transformation which has been forced automatically.
--
__("< Marcin Kowalczyk
\__/ qrczak@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
- Next message: Anne & Lynn Wheeler: "Re: Xah Lee's Unixism"
- Previous message: Andre Majorel: "Re: Xah Lee's Unixism"
- In reply to: Ray Dillinger: "Re: Lisp's other than CLISP that support full Unicode character repertoire?"
- Next in thread: Ray Dillinger: "Unicode LISP??"
- Reply: Ray Dillinger: "Unicode LISP??"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|