Re: UTF of Java strings?

From: Mike Schilling (mscottschilling_at_hotmail.com)
Date: 10/11/04

  • Next message: Thomas Weidenfeller: "Re: Recommend a book?"
    Date: Mon, 11 Oct 2004 07:31:01 GMT
    
    

    "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> wrote in message
    news:g_WdnXypFJmpkPTcRVn-vw@nildram.net...
    > Mark Thornton wrote:
    >
    >> I also remember quite a few early benchmarks that proved little more
    >> than that the 'C' version used 1 byte for characters while Java was
    >> always using 2.
    >
    > <grim chuckle/>
    >
    >
    >> > One effect, of course, is that any code that uses the 'char' primitive
    >> > data-type, or indexes in char[] data, however indirectly[*], is now
    >> > automatically broken (at least for internationalised purposes). For
    >> > instance String.charAt() and String.substring() are now /extremely/
    >> > dubious -- as are most of the other methods of String.
    >>
    >> Fortunately many real uses of String/char will continue to work even if
    >> only by accident.
    >
    > And there you have, I think, put your finger on the real problem. If the
    > new
    > "rules" were such that a mistaken dependence on 16-bitness were
    > immediately
    > obvious (perhaps caught by the type checker, perhaps just obvious at
    > runtime
    > because /every/ erroneous use resulted in garbage) then there'd be less of
    > a
    > problem. As it is, it's likely that there will be all kinds of bugs
    > lurking
    > that only show up when you hit a system with Unicode data that isn't
    > expressible as 16-bit.
    >
    > It's that insecurity and fragility, more than anything, that makes me
    > label the
    > situation a "mess".
    >
    > As an aside: ironically enough, you can think of this mess as a direct
    > result
    > of the Java features -- static typing, final classes -- that are meant to
    > provide protection against lurking bugs. In fact they made the system too
    > rigid to cope with what (in another design) would have been a
    > comparatively
    > simple evolutionary step. Just personal opinion, of course.

    It seems to me that the problem is the "char" type being defined as a
    16-bit, more or less integer type, e.g. the guarantee that

        char c;
        short s = (short) c;
        c == (char)s; // always true

    If the size of char were undefined, and char was not convertible to any
    integer type, I don't see why the Unicode change would have caused any
    incompatibilities. Internally, char would have increased from 2 to 3 bytes,
    but that wouldn't be visible. A String would still be convertible to and
    from an array of chars; under any encoding, a string could still be
    converted to and from a byte array; UTF-8 and UTF-16 would still behave
    identically with 16-bit characters, and would also behave sensibly once
    non-16-byte characters were introduced, etc. Code that pawed through UTF-8
    or UTF-16 byte arrays might need to be reworked, but that's unavoidable: the
    definitions of the encodings have changed (e.g UTF-8 now uses 1-5 bytes per
    character instead of 1-3). Code that stayed in the char/String world would
    continue to work unchanged.


  • Next message: Thomas Weidenfeller: "Re: Recommend a book?"

    Relevant Pages

    • Re: Get ASCII value for character when higher than 127
      ... UTF-8 will handle it. ... the correct int value for the special characters. ... char timeString; ... strcat; ...
      (microsoft.public.vc.language)
    • Re: How to socket and utf-8?
      ... The type char is merely a small integer, ... _characters_ and some _integers_. ... Since the integers of the ASCII ... character X encoded in UTF-8. ...
      (comp.unix.programmer)
    • Re: "directory order" - K and R 2 exercise 5-16?
      ... >> What it say is, ignore other characters than letters, numbers ... >> and blanks, when sorting. ... A char is an integer type. ...
      (comp.lang.c)
    • Re: Forth 200x, S\q
      ... and that where characters are larger than 1 byte, ... There are a number of options what \xHH means in a Unicode Forth: ... It means a char ... chars with \x that would be illegal as UTF-8 encoded Unicode ...
      (comp.lang.forth)
    • Re: Question on Data Type Declarations
      ... the type 'char' is numeric?? ... but char is obviously a single character like 'a' or 'b' or ... As it happens, that's sufficient to represent characters on most systems, ... but it's actually just another integer type. ...
      (comp.lang.c)