Re: Invariant with DIGIT-CHAR-P and the reader.



> * josephoswaldgg@xxxxxxxxxxx <wbfrcubfjnyq@xxxxxxxxx> [2005-05-17
> 09:56:14 -0700]:
>
>> what should
>>
>> (read-from-string
>> (concatenate 'string
>> (string #\ARABIC-INDIC_DIGIT_ONE)
>> (string #\DEVANAGARI_DIGIT_TWO)
>> (string #\BENGALI_DIGIT_THREE)
>> (string #\GUJARATI_DIGIT_FOUR)
>> (string #\TAMIL_DIGIT_FIVE)))
>>
>> return?
>> Do you seriously expect such string to mean 12345?
>>
>
> Why not? How else would the producer of such a string mean it to be
> interpreted? Is it not just as unreasonable for the producer to
> deliberately intend such a Lisp symbol? You've deliberately chosen an
> edge case, so who would reasonably *rely* on behavior either way?

CL reader parses as numbers things that "look like a number".
no one will look at the string above and say "yeah, that's a number".

> "What is a word" and "What is a number" is an application- or
> domain-specific question, which cannot be answered in a language spec
> or by an implementation.

this is precisely why the CL reader should _not_ interpret the above
string as a number: because CL reader operates in the CL domain where
the above is not a number as per the CL syntax.

OTOH, (DIGIT-CHAR-P #\DEVANAGARI_DIGIT_TWO) returning 2 is useful
because this is a domain-neutral issue of the nature of the Unicode
character in question.

> Anyway, in the original context, once digit-char-p starts declaring
> things "numeric" there is always a danger such characters will get
> treated in ways that a naive Lisp program might be trying to mimic
> another program, and that other program may use the Lisp reader or
> something similar.

Lisp reader is for Lisp data (including Lisp code).
[yes, it is extensible, but it is extensible to incorporate "Lisp-like"
data, not "every natural syntax you can imagine"; it is relatively easy
to make READ parse XML (CLOCC/CLLIB/xml.lisp), but not C]

There is no way to tell the CL reader to
print 2 as (string #\DEVANAGARI_DIGIT_TWO),
thus there is no reason to read 2 from (string #\DEVANAGARI_DIGIT_TWO).

I hope we all agree on this.

> Instead of requiring every user of digit-char-p to sterilize his data,

what do you mean?
if your data contains Unicode characters, you should know about Unicode.

In Unicode, #\DEVANAGARI_DIGIT_TWO is a digit, and its weight is 2.
[it's not like CLISP is searching for substrings "TWO" in character
names :-)]
This is the same level statement as "in CL, (CAR NIL) returns NIL".
If you do not like what the Unicode international standard says, don't
use Unicode, use ASCII (yes, you can build CLISP in ASCII mode).
If you do not like (CAR NIL) ==> NIL, don't use CL, use Scheme.



--
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.iris.org.il> <http://www.dhimmi.com/> <http://www.camera.org>
<http://www.memri.org/> <http://www.palestinefacts.org/>
Those who value Life above Freedom are destined to lose both.
.



Relevant Pages

  • Re: Rndzl Problem
    ... The string my lisp program produces: ... The lisp I am using is ECL. ... Windows is rather braindamaged about using Unicode, ... not also have external formatting support, ...
    (comp.lang.lisp)
  • Re: A simple metaobject protocol for packages
    ... The default behavior is to turn the string into ... (defgeneric search-symbol (package-class package string) ... It is just that the reader my do some case conversion ... case and upper case characters in Unicode apparently doesn't make a lot ...
    (comp.lang.lisp)
  • Re: CLisp case sensitivity
    ... Strings are measured in characters so ... > length of the string. ... Unicode capable Lisp) there is no char with the char-code #xD800, ...
    (comp.lang.lisp)
  • Re: Tranfering unicod charcters in Socket programming!
    ... You are telling about conversion b/w MBCS to Unicode. ... If this is not possible Shall I try with string to wstring ... int SendStringAsUnicode ...
    (microsoft.public.win32.programmer.networks)
  • Re: using structs like BROWSEINFO and OPENFILENAME (string members
    ... your discussion of unicode ... vs ansi reminded me to recheck my typelib and found a couple of errors. ... > is declared as string, the other is declared as long. ...
    (microsoft.public.vb.winapi)