Re: Unicode LISP??
From: Marcin 'Qrczak' Kowalczyk (qrczak_at_knm.org.pl)
Date: 09/04/04
- Next message: Joe Pfeiffer: "Re: Xah Lee's Unixism"
- Previous message: moffatt: "LISP string syntax highlighting help"
- In reply to: Ray Dillinger: "Unicode LISP??"
- Next in thread: Ray Dillinger: "Re: Unicode LISP??"
- Reply: Ray Dillinger: "Re: Unicode LISP??"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sat, 04 Sep 2004 21:55:11 +0200
Ray Dillinger <bear@sonic.net> writes:
> Okay... If you were designing a LISP, from the ground up, to be
> a fully unicode-aware language that "Does Unicode Right" what would
> you do?
I'm not experienced with Common Lisp library, so it's hard to tell
where it's incompatible with Unicode.
One thing that I noted previously: case mapping should be defined in
terms of strings rather than characters.
Unfortunately this causes problems for the deeply hardwired case
insensitiveness, because ignoring case is no longer such a simple
thing. For example it would no longer be true that
(string= (string-downcase s) (string-downcase (string-upcase s)))
which fails for strings containing "ß", final small sigma, dotless i,
apostrophe-n, long s, Greek iota under letter, ligatures like "fi"
or other weird characters.
Unicode defines text mappings:
- upcasing
- downcasing
- titlecasing
- case folding
where case folding is the important one for case insensitive
comparison. If two strings can be brought into the same string by
other case operations, they case fold to the same. It's often the same
as lowercasing, but it differs from it for various special cases like
the above.
Neither lowercasing alone nor uppercasing alone is sufficient to fold
all case differences. Lowercasing alone fails for characters like
mentioned above. Uppercasing alone fails for capital I with dot above,
Greek capital theta symbol and some compatibility variants of capital
letters which don't have unique lowercase equivalents.
For me case sensitiveness in a programming language would be a good
choice, but Lisp tradition is being case insensitive.
String representation is not obvious. Let's assume for now that from
the programmer's point of view strings consist of code points.
If they are represented in UTF-8 or UTF-16, string indexing is not
O(1).
If they are represented in UTF-32, ASCII strings take 4 times more
space than byte-packed ASCII would take.
If they are represented in UTF-32 or ISO-8859-1, depending on whether
they contain some character above U+00FF, then strings may need to
have their representation upgraded if they are updated in place.
Some languages don't have this problem by making strings immutable and
using some other type for mutable strings (e.g. Python, Java, C#).
It's fine for me, but again Lisp tradition is to have mutable strings.
Anyway, if they need to be upgraded, there are two ways. Either
a string physically contains a pointer to characters instead of
characters themselves, or they require some garbage collector tricks
to be able to extend an object in place, perhaps by physically moving
it elsewhere and updating pointers pointing to it. The latter is what
CLisp does AFAIK, and it uses 3 string representations depending on
which characters are present: 8-, 16-, or 32-bit.
As I said, I don't believe that a "more abstract" representation than
a string of code points is feasible.
> 1) Combining codepoints in isolation are members of the
> character datatype, but, like control characters and
> characters with buckybits in CLTL2, they aren't
> string-characters; you can't put them into strings as
> independent characters.
If strings are not isomorphic to sequences of characters (whatever
exactly "characters" mean), I predict confusion and breakage. In about
any language which has characters as a dictinct type from strings,
strings are sequences of characters.
Programs usually work on strings consisting of "well-behaved"
"regular" characters, so bugs in this area would be often left
undetected until someone feeds the program with a text containing
rare characters in an unusual combination.
For example assume that a HTML file contains
ṣ
and the program resolves numeric character references to actual
characters, "combining dot below" in this case. A straightforward
implementation would try to put it as a character in a string.
--
__("< Marcin Kowalczyk
\__/ qrczak@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
- Next message: Joe Pfeiffer: "Re: Xah Lee's Unixism"
- Previous message: moffatt: "LISP string syntax highlighting help"
- In reply to: Ray Dillinger: "Unicode LISP??"
- Next in thread: Ray Dillinger: "Re: Unicode LISP??"
- Reply: Ray Dillinger: "Re: Unicode LISP??"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|