Re: CLisp case sensitivity

From: Pascal Bourguignon (spam_at_mouse-potato.com)
Date: 12/16/04


Date: 16 Dec 2004 01:50:22 +0100

Adam Warner <usenet@consulting.net.nz> writes:

> Hi Duane Rettig,
>
> >> I made a simple claim Barry: Since ANSI Common Lisp doesn't define the
> >> size of a character the length of an arbitrary string will be
> >> implementation specific.
> >
> > This claim is false, by definition, since length is specified in terms
> > of a count, and not in terms of widths in some other units of measure.
>
> Here is an arbitrary string encoded in UTF-8: "€" [You
> may generate it in CLISP using (string (code-char #x10000))]. It
> consists of a single code point.

No. You have to specify an external format, you cannot generate it jus
with (string (code-char #x10000)). For example in my case, it gives
this error:

Oops, that was with -E utf-16...

Rather try:

    (with-open-file (out "test.utf-8" :direction :output
                        :if-does-not-exist :create
                        :if-exists :supersede
                        :external-format charset:utf-8)
        (princ (string (code-char #x10000)) out))

 
> I expect (cl:length "€") will NOT return 1 in a 16-bit
> character Allegro yet it will return 1 in CLISP and SBCL. I expect:

Not exactly. In all encoding with >= 8 bits in clisp, this string:
     "€"
as a length of 4 characters.

In encodings with < 8 bits, it contains invalid characters:

$ /usr/local/bin/clisp -ansi -norc -q -E ascii
[1]> "€"

*** - invalid byte #xF0 in CHARSET:ASCII conversion
Break 1 [2]>

Now, even when you're using 7-bit encoding as default external format
for files, terminal, etc, a string containing the unicode character of
code #x10000 is always a string of one character:

[3]> (length (string (code-char #x10000)))
1

[4]> (string (code-char #x10000))

*** - Character #\u00010000 cannot be represented in the character set CHARSET:ASCII
Break 1 [5]>

> (let ((s (copy-seq "€")))
> (setf (char s 0) #\A)
> s)

You are abusing strings, using them to store _codes_ instead of
characters. This cannot be portable Common Lisp.

All this subject is silly, it's like asking that (length "SGVsbG8K")
returns 5 because (to-base64 "Hello") returns "SGVsbG8K".

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
Cats meow out of angst
"Thumbs! If only we had thumbs!
We could break so much!"


Relevant Pages

  • [TOMOYO #15 3/8] Common functions for TOMOYO Linux.
    ... This file contains common functions (e.g. policy I/O, pattern matching). ... Since TOMOYO Linux is a name based access control, ... TOMOYO Linux's string manipulation functions make reviewers feel crazy, ... the Linux kernel accepts all characters but NUL character ...
    (Linux-Kernel)
  • Re: Soft-hyphens or breakable points in a string
    ... I specify a table width of 100%, but otherwise no cell dimensions are specified. ... An E-mail address is basically an unbreakable string that must not contain whitespace. ... URL or E-mail address needs to be broken, the break should appear "between elements, after a colon, a slash, a double slash, or the symbol @ but before a period or any other punctuation or symbols". ... If you enter a soft hyphen character, MS Word treats it as yet another graphic character and displayes it in all occasions. ...
    (comp.infosystems.www.authoring.html)
  • RfD: Escaped Strings version 4
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... as an escape character for the entry of characters that cannot be ... \b BS (backspace, ASCII 8) ...
    (comp.lang.forth)
  • RfD: Escaped Strings version 4
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... as an escape character for the entry of characters that cannot be ... \b BS (backspace, ASCII 8) ...
    (comp.lang.forth)
  • Re: RfD: Escaped Strings
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... \b BS (backspace, ASCII 8) ... \ ** escapes to characters much as C does. ...
    (comp.lang.forth)