Re: How to display German Umlauts correctly on Mac and Unix

From: Jon A. Cruz (jon_at_joncruz.org)
Date: 03/25/04

  • Next message: Jonathan Fuerth: "Re: How to display German Umlauts correctly on Mac and Unix"
    Date: Thu, 25 Mar 2004 10:12:58 -0800
    
    

    Jonathan Fuerth wrote:
    >
    > Try "jikes -help" and notice the "-encoding" option. Yours may not have
    > it because you have to enable it at compile time (it makes the binary
    > larger due to translation tables).

    By default, over many years, jikes has not included that for the
    majority of users.

    Also try "jikes -help" and you'll see an option for -O to enable
    optimization. Again, this does nothing and has done nothing.

    If you have to custom compile build tools to get a feature, it's not so
    safe to assume that feature is around when you start to porting.

    > As for the other compilers that don't support UTF-8, this is possible
    > but unlikely.

    No.

    It's been very likely, as it has been the case for years. Things are
    changing, but many machines don't get the latest and greatest.

    > The JLS chapter 3 clearly states that "Java programs are
    > written in Unicode ..." and specifies a set of three lexical translation
    > steps, of which steps 1 and 2 refer to the source file as a "Unicode
    > stream." The UTF-8 encoding is not mandated, but it's relatively easy
    > to implement compared to Latin-N -> Unicode.

    I'm very familiar with the JLS. And with the Unicode translation steps.

    It's saying that in *concept* Java programs are streams of Unicode
    characters. And that before going through the compiler, whatever the
    input is has to become Unicode. Notice that it starts by saying they
    "are written using the Unicode character set". The "character set" is
    something different than encoding. The former refers to the high-level
    textual unit, while the latter refers to how those are represented using
    numbers and bytes and byte sequences.

    Remember, "character set" != "encoding" when talking of Unicode in this
    context.

    Also, your assertion that it's relatively easy has no bearing on whether
    or not marketing decisions, product management and development budgets
    have decided to implement this non-critical feature. Far more often than
    not they have chosen not to do so.

    >
    >> Also, it makes sources compile correctly no matter where they are
    >> compiled. Going cross-platform this can be a major issue. Going to
    >> non-US locales and data also hits this.
    >
    >
    > No, because UTF-8 is a Unicode transformation encoding and Unicode is
    > not locale-specific.

    However...

    What ever compiler you're looking at using on those boxes have to have
    decided to follow your feeling and implemented UTF-8 support. Most do not.

    >
    >> Otherwise, code can be broken by transfers, copying from one computer
    >> to another, etc. FTP can break it.
    >
    >
    > No. RFC 2277 (IETF Policy on Character Sets and Languages) clearly
    > states at the top of page 3, "Protocols MUST be able to use the UTF-8
    > charset ..." but even without explicit support, FTP in ASCII or binary
    > mode would not break UTF-8. CR - CRLF - LF conversions won't break
    > UTF-8. Other charset conversions (Latin-N to Latin-M) would.

    No... but broken FTP implementations (servers and clients) when sending
    text (aka the "ASCII" misnomer) mode try to be helpful and "fix" things
    for you.

    >
    >> Network file shares can break it.
    >
    >
    > Which ones, and how?

    NFS and Samba are a few of these (They are the ones I know off the top
    of my head, as they are the ones I've worked with). Some of the default
    options for Samba deal with character set and encoding.

    >
    >> CVS can break it.
    >
    >
    > Not likely, given the nature of UTF-8. Give me an example.

    Check in and out with different character sets and things get mangled.
    I've seen it happen at work. Much depends on your sever, your specific
    clients and how "helpful" they are.

    >
    >> Sneaker-net can break it.
    >
    >
    > Only if your floppydisk's filesystem munges the character set.

    Many do. Not the floppy itself, but the going in and out of "foreign"
    file systems stuff.

    Again, I've seen this going from PC to Mac all the time. Since I've been
    working on software for the non-US market on and off for abou 1% years
    now, I've seen it happen and know to go looking for it.

    >
    >> And having worked in multi-platform teams, I've seen a lot of those
    >> first-hand.
    >
    >
    > So it should be a snap for you to give me concrete examples of each of
    > your claims.

    Remember, I said "can" in all these, not "will" or "most likely to". I
    don't have time at the moment to recreate all those situations (since I
    do have to work for a living in addition to the open source projects I
    help with).

    Just to be sure, I am a proponent of using UTF-8 for data, and do so
    mainly for XML, but for Java sources that's still not 100% safe as we
    would like it to be.


  • Next message: Jonathan Fuerth: "Re: How to display German Umlauts correctly on Mac and Unix"

    Relevant Pages