Re: Quieter glyphs than parentheses
From: Tom Lord (lord_at_emf.emf.net)
Date: 02/08/04
- Next message: John M. Adams: "Re: To use or not to use CLOS"
- Previous message: Christopher C. Stacy: "Re: compile"
- In reply to: Steven M. Haflich: "Re: Quieter glyphs than parentheses"
- Next in thread: Michael Hudson: "Re: Quieter glyphs than parentheses"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sun, 08 Feb 2004 04:21:34 -0000
Ray Dillinger:
>> Moving a Lisp to Unicode is quite an undertaking if you intend to
>> do it right, because source code and data are the same 'language'
>> so you've got an entire toolchain that mostly has to be built
>> from scratch.
Steven M. Haflich:
> I must disagree. When Franz converted Allegro CL from an 8-bit
> character system to one that could be configured (albeit globally)
> either with 8-bit ASCII or 16-bit Unicode characters, it did not
> require rewriting the entire [toolchain]. There were issues here
> and there, but most things just work. There is some nonobvious
> hair internally supporting Unicode (e.g. handling the
> correspondence between upper- and lower-case chars while
> preserving speed of the char/string comparison and predicate
> functions, but once the compiler understands both formats of chars
> and strings, most code works automatically.
~ Support for 16-bit Unicode codepoints is, at best, imperfect and
partial support for Unicode. Unicode codepoints are 21 bits.
~ Simply doubling the size-per-character of a string is a poor way to
support Unicode strings. For many common uses it is inefficient.
If the entire codepoint space is supported, then it results in a
variable-length encoding for codepoints. In other words, good
Unicode support requires trickier changes to the representation of
strings than simply doubling the size of character elements (and
tricker than just "use UTF-8", too).
~ Wouldn't you agree that good Unicode support will include allowing
users to write identifier names using Unicode characters other than
those it has in common with ASCII? Yet if it does so, there is an
intricate problem of deciding when two differently encoded
identifiers are "the same" and of producing, from an identifier, the
name of the symbol it can denote. These issues effect not only
readers and run-time systems, but also compilers and linkers -- in
short "an entire toolchain".
Moreover, it is desirable to solve read/write equivalence for symbol
literals in an implementation-independent way so that
Unicode-supporting Schemes can exchange data. If Scheme were more
mature at this point in history, and thus had a binary standard,
we'd be facing an analogous problem for "mangled" identifiers fed to
tools such as system linkers.
~ You'd think that R5RS, since it _does_ take some care to be
"character set agnostic", could easily be extended to support
Unicode. Unfortunately, that's not the case. In spite of its
agnosticism, it imposes requirements that a Unicode-supporting
Scheme would not want to satisfy. It's not just the entire
toolchain, but the standard that defines its requirements, too.
> Some additional work comes in supporting all the external formats needed
> by non ISO8859 language scripts.
It's funny but _that_ problem is the one I would consider fairly
(tedious but) trivial.
> While Unicode represents pretty much everything in a nice, flat,
> 16-bit code (ignoring that Unicode has actually recently
> overflowed 16 bits -- sigh!)
Why should we ignore that? 16 bits is clearly not enough.
> and UTF-8, which is a simple space-saving encoding of Unicode,
"space-saving" (for some scripts but not others) and "time-consuming".
Scheme's should not use that encoding internally unless fast
string-processing and unbiased internationalizatin are non-goals.
> most of the difficult languages have one or more different
> variable-length encodings. For example, Japanese has three
> popular non-Unicode-based encodings, and Lisp applications may
> ultimately need to deal with each. (For example, my Japanese wife
> regularly receives email in _four_ different encodings.)
> In addition to having Lisp understand all these obscure encodings,
I may be over-reading you but it sounds as though you are saying by
implication that Scheme standards for internationalization of the
language should be designed in such a way that not only Unicode but
also all of those non-Unicode character sets are equally well
supported.
To a very limited extent that's probably true. For example, I believe
that R6RS can and should retain that level of "character set
agnosticism".
But in a broader sense it misses something important: while the
Revised Report may have originated in an environment in which
networking was in its childhood years, today it is a dominant
consideration. While R6RS can remain agnostic about character sets,
nevertheless, standards for interoperability covering extended
character sets will be necessary for communicating implementations.
Those interoperability standards, supporting at the very least the
reliable exchange of S-expressions between implementations, can not
afford to be agnostic about character sets. Were they to go that
route, the Scheme community would be, in effect, taking on the task of
"Out-Unicoding the Unicode Consortium" -- defining a union character
set where the consortium has declined to do so.
-t
- Next message: John M. Adams: "Re: To use or not to use CLOS"
- Previous message: Christopher C. Stacy: "Re: compile"
- In reply to: Steven M. Haflich: "Re: Quieter glyphs than parentheses"
- Next in thread: Michael Hudson: "Re: Quieter glyphs than parentheses"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|
|