Re: Unicode LISP??

From: Ray Dillinger (bear_at_sonic.net)
Date: 09/06/04


Date: Mon, 06 Sep 2004 05:47:50 GMT

Marcin 'Qrczak' Kowalczyk wrote:
> Ray Dillinger <bear@sonic.net> writes:

>>1) Combining codepoints in isolation are members of the
>> character datatype, but, like control characters and
>> characters with buckybits in CLTL2, they aren't
>> string-characters; you can't put them into strings as
>> independent characters.
>
>
> If strings are not isomorphic to sequences of characters (whatever
> exactly "characters" mean), I predict confusion and breakage. In about
> any language which has characters as a dictinct type from strings,
> strings are sequences of characters.

Well, the consideration in CLTL was that the "character"
datatype actually represented two different things. Characters,
and keystrokes. Alt-J is a keystroke. Uppercase J is a
character. It's entirely reasonable to collect characters in
strings; but it's not reasonable to have "strings" of
keystrokes.

So CLTL had this distinction: "characters" as a datatype
included keystrokes, but only true characters (not keystrokes)
were supposed to be string-characters. CLTL2 contained
reference to this, but the committee decision was to allow
buckybits, font bits, and other stuff that could make something
into a non string-character to exist as "implementation defined
attributes" and strike the specification of that behavior from
the standard.

In a grapheme-based system, a combining codepoint by itself,
similarly, is an entity you might have to work with at times,
but it isn't a true character; it doesn't make sense to stick
it into strings by itself without a base character to modify.

Anyway, it was just one of many ideas. I actually think I
prefer the system where the language _primitives_ allow one or
more codepoints per character and enforce absolutely nothing
about which codepoints they may be. All that would come out
in library code for the UNICODE Character Set, and with
different libraries, you could work with the UNI-21 character
set where there is one codepoint per character, or the UNI-16
character set where there is one codepoint per character and
it's restricted to sixteen bits, or the LATIN-1 character set
where there's one codepoint per character and it's restricted
to 8 bits.

                                Bear



Relevant Pages

  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: Need help on string manipulation
    ... better to convert strings to UCS-32 before manipulation? ... Characters represented by wchar_t must use one wchar_t per character, ... which may use a multibyte encoding. ... use some newer Unicode characters, if this is a problem for you, then ...
    (comp.lang.c)
  • Re: Copying string to byte array
    ... of Strings and the CryptEncrypt + CryptDecrypt APIs. ... binary data should not be held in String variables. ... a) not all character codes are valid in a given ...
    (microsoft.public.vb.general.discussion)
  • Re: Zero terminated strings
    ... standard library that told you how to encode strings.) ... this redundant terminator as well). ... (Other string libraries like Vstr have ...
    (comp.lang.c)
  • Re: Suggested Alternative Unicode Implementation (for Rudy+ miscothers)
    ... an array of 2 elements. ... Strings basically are containers for uniform elements. ... In the past I often wondered, why Pascal doesn't distinguish between string and character literals, where e.g. C uses different quotes. ... We only should get rid of the idea, that a "char" has to do anything with a character or codepoint, and that instead it only represents a *storage* element inside an string container, which should never be touched when dealing with text. ...
    (borland.public.delphi.non-technical)