Re: UTF of Java strings?
From: Mike Schilling (mscottschilling_at_hotmail.com)
Date: 10/11/04
- Previous message: blmblm_at_myrealbox.com: "Re: emacs Vs Eclipse?"
- In reply to: Chris Uppal: "Re: UTF of Java strings?"
- Next in thread: Chris Uppal: "Re: UTF of Java strings?"
- Reply: Chris Uppal: "Re: UTF of Java strings?"
- Reply: bugbear: "Re: UTF of Java strings?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Mon, 11 Oct 2004 07:31:01 GMT
"Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> wrote in message
news:g_WdnXypFJmpkPTcRVn-vw@nildram.net...
> Mark Thornton wrote:
>
>> I also remember quite a few early benchmarks that proved little more
>> than that the 'C' version used 1 byte for characters while Java was
>> always using 2.
>
> <grim chuckle/>
>
>
>> > One effect, of course, is that any code that uses the 'char' primitive
>> > data-type, or indexes in char[] data, however indirectly[*], is now
>> > automatically broken (at least for internationalised purposes). For
>> > instance String.charAt() and String.substring() are now /extremely/
>> > dubious -- as are most of the other methods of String.
>>
>> Fortunately many real uses of String/char will continue to work even if
>> only by accident.
>
> And there you have, I think, put your finger on the real problem. If the
> new
> "rules" were such that a mistaken dependence on 16-bitness were
> immediately
> obvious (perhaps caught by the type checker, perhaps just obvious at
> runtime
> because /every/ erroneous use resulted in garbage) then there'd be less of
> a
> problem. As it is, it's likely that there will be all kinds of bugs
> lurking
> that only show up when you hit a system with Unicode data that isn't
> expressible as 16-bit.
>
> It's that insecurity and fragility, more than anything, that makes me
> label the
> situation a "mess".
>
> As an aside: ironically enough, you can think of this mess as a direct
> result
> of the Java features -- static typing, final classes -- that are meant to
> provide protection against lurking bugs. In fact they made the system too
> rigid to cope with what (in another design) would have been a
> comparatively
> simple evolutionary step. Just personal opinion, of course.
It seems to me that the problem is the "char" type being defined as a
16-bit, more or less integer type, e.g. the guarantee that
char c;
short s = (short) c;
c == (char)s; // always true
If the size of char were undefined, and char was not convertible to any
integer type, I don't see why the Unicode change would have caused any
incompatibilities. Internally, char would have increased from 2 to 3 bytes,
but that wouldn't be visible. A String would still be convertible to and
from an array of chars; under any encoding, a string could still be
converted to and from a byte array; UTF-8 and UTF-16 would still behave
identically with 16-bit characters, and would also behave sensibly once
non-16-byte characters were introduced, etc. Code that pawed through UTF-8
or UTF-16 byte arrays might need to be reworked, but that's unavoidable: the
definitions of the encodings have changed (e.g UTF-8 now uses 1-5 bytes per
character instead of 1-3). Code that stayed in the char/String world would
continue to work unchanged.
- Previous message: blmblm_at_myrealbox.com: "Re: emacs Vs Eclipse?"
- In reply to: Chris Uppal: "Re: UTF of Java strings?"
- Next in thread: Chris Uppal: "Re: UTF of Java strings?"
- Reply: Chris Uppal: "Re: UTF of Java strings?"
- Reply: bugbear: "Re: UTF of Java strings?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|