Re: Unicode LISP??

From: Marcin 'Qrczak' Kowalczyk (qrczak_at_knm.org.pl)
Date: 09/04/04


Date: Sat, 04 Sep 2004 21:55:11 +0200

Ray Dillinger <bear@sonic.net> writes:

> Okay... If you were designing a LISP, from the ground up, to be
> a fully unicode-aware language that "Does Unicode Right" what would
> you do?

I'm not experienced with Common Lisp library, so it's hard to tell
where it's incompatible with Unicode.

One thing that I noted previously: case mapping should be defined in
terms of strings rather than characters.

Unfortunately this causes problems for the deeply hardwired case
insensitiveness, because ignoring case is no longer such a simple
thing. For example it would no longer be true that
   (string= (string-downcase s) (string-downcase (string-upcase s)))
which fails for strings containing "ß", final small sigma, dotless i,
apostrophe-n, long s, Greek iota under letter, ligatures like "fi"
or other weird characters.

Unicode defines text mappings:
- upcasing
- downcasing
- titlecasing
- case folding
where case folding is the important one for case insensitive
comparison. If two strings can be brought into the same string by
other case operations, they case fold to the same. It's often the same
as lowercasing, but it differs from it for various special cases like
the above.

Neither lowercasing alone nor uppercasing alone is sufficient to fold
all case differences. Lowercasing alone fails for characters like
mentioned above. Uppercasing alone fails for capital I with dot above,
Greek capital theta symbol and some compatibility variants of capital
letters which don't have unique lowercase equivalents.

For me case sensitiveness in a programming language would be a good
choice, but Lisp tradition is being case insensitive.

String representation is not obvious. Let's assume for now that from
the programmer's point of view strings consist of code points.

If they are represented in UTF-8 or UTF-16, string indexing is not
O(1).

If they are represented in UTF-32, ASCII strings take 4 times more
space than byte-packed ASCII would take.

If they are represented in UTF-32 or ISO-8859-1, depending on whether
they contain some character above U+00FF, then strings may need to
have their representation upgraded if they are updated in place.

Some languages don't have this problem by making strings immutable and
using some other type for mutable strings (e.g. Python, Java, C#).
It's fine for me, but again Lisp tradition is to have mutable strings.

Anyway, if they need to be upgraded, there are two ways. Either
a string physically contains a pointer to characters instead of
characters themselves, or they require some garbage collector tricks
to be able to extend an object in place, perhaps by physically moving
it elsewhere and updating pointers pointing to it. The latter is what
CLisp does AFAIK, and it uses 3 string representations depending on
which characters are present: 8-, 16-, or 32-bit.

As I said, I don't believe that a "more abstract" representation than
a string of code points is feasible.

> 1) Combining codepoints in isolation are members of the
> character datatype, but, like control characters and
> characters with buckybits in CLTL2, they aren't
> string-characters; you can't put them into strings as
> independent characters.

If strings are not isomorphic to sequences of characters (whatever
exactly "characters" mean), I predict confusion and breakage. In about
any language which has characters as a dictinct type from strings,
strings are sequences of characters.

Programs usually work on strings consisting of "well-behaved"
"regular" characters, so bugs in this area would be often left
undetected until someone feeds the program with a text containing
rare characters in an unusual combination.

For example assume that a HTML file contains
   s&#803;
and the program resolves numeric character references to actual
characters, "combining dot below" in this case. A straightforward
implementation would try to put it as a character in a string.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/


Relevant Pages

  • Re: Why R6RS is controversial
    ... the semantics of the language, ... behavior of grapheme-cluster characters under most linguistic ... as the strings grow longer. ... Normalization is hideously complicated, and may require many ...
    (comp.lang.scheme)
  • Re: join("") somehow changes characters after z
    ... It prints strings which contain only characters ... contain characters outside of this range as 1 utf8 sequence per ... This is independent of how the strings are represented ...
    (comp.lang.perl.misc)
  • Re: not quite 1252
    ... The kill_gremlins function is intended to fix Unicode strings that have been obtained by decoding 8-bit strings using 'latin1' instead of 'cp1252'. ... In fact it wasn't, it was UTF-8 like Sergei wrote, but it was easy to convert it to cp1252, no problem. ... characters to documents marked up as ISO 8859-1 or other encodings. ...
    (comp.lang.python)
  • Re: Terminology question: are s-exps the text or the data or both?
    ... Those "special characters" and "strings of capital Latin letters and digits ... alternate representation is the Lisp objects produced by the reader. ...
    (comp.lang.lisp)
  • Re: How to check variables for uniqueness ?
    ... characters is the sequence SS. ... is simply capitalizing strings. ... The fact that case mapping in English /is/ simple is neither here not ... That is a fair criticism of the Unicode position. ...
    (comp.lang.java.programmer)