Re: How to display German Umlauts correctly on Mac and Unix
From: Jon A. Cruz (jon_at_joncruz.org)
Date: 03/25/04
- Previous message: Jonathan Fuerth: "Re: JSplitPane moving alone"
- In reply to: Jonathan Fuerth: "Re: How to display German Umlauts correctly on Mac and Unix"
- Next in thread: Jonathan Fuerth: "Re: How to display German Umlauts correctly on Mac and Unix"
- Reply: Jonathan Fuerth: "Re: How to display German Umlauts correctly on Mac and Unix"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Thu, 25 Mar 2004 10:12:58 -0800
Jonathan Fuerth wrote:
>
> Try "jikes -help" and notice the "-encoding" option. Yours may not have
> it because you have to enable it at compile time (it makes the binary
> larger due to translation tables).
By default, over many years, jikes has not included that for the
majority of users.
Also try "jikes -help" and you'll see an option for -O to enable
optimization. Again, this does nothing and has done nothing.
If you have to custom compile build tools to get a feature, it's not so
safe to assume that feature is around when you start to porting.
> As for the other compilers that don't support UTF-8, this is possible
> but unlikely.
No.
It's been very likely, as it has been the case for years. Things are
changing, but many machines don't get the latest and greatest.
> The JLS chapter 3 clearly states that "Java programs are
> written in Unicode ..." and specifies a set of three lexical translation
> steps, of which steps 1 and 2 refer to the source file as a "Unicode
> stream." The UTF-8 encoding is not mandated, but it's relatively easy
> to implement compared to Latin-N -> Unicode.
I'm very familiar with the JLS. And with the Unicode translation steps.
It's saying that in *concept* Java programs are streams of Unicode
characters. And that before going through the compiler, whatever the
input is has to become Unicode. Notice that it starts by saying they
"are written using the Unicode character set". The "character set" is
something different than encoding. The former refers to the high-level
textual unit, while the latter refers to how those are represented using
numbers and bytes and byte sequences.
Remember, "character set" != "encoding" when talking of Unicode in this
context.
Also, your assertion that it's relatively easy has no bearing on whether
or not marketing decisions, product management and development budgets
have decided to implement this non-critical feature. Far more often than
not they have chosen not to do so.
>
>> Also, it makes sources compile correctly no matter where they are
>> compiled. Going cross-platform this can be a major issue. Going to
>> non-US locales and data also hits this.
>
>
> No, because UTF-8 is a Unicode transformation encoding and Unicode is
> not locale-specific.
However...
What ever compiler you're looking at using on those boxes have to have
decided to follow your feeling and implemented UTF-8 support. Most do not.
>
>> Otherwise, code can be broken by transfers, copying from one computer
>> to another, etc. FTP can break it.
>
>
> No. RFC 2277 (IETF Policy on Character Sets and Languages) clearly
> states at the top of page 3, "Protocols MUST be able to use the UTF-8
> charset ..." but even without explicit support, FTP in ASCII or binary
> mode would not break UTF-8. CR - CRLF - LF conversions won't break
> UTF-8. Other charset conversions (Latin-N to Latin-M) would.
No... but broken FTP implementations (servers and clients) when sending
text (aka the "ASCII" misnomer) mode try to be helpful and "fix" things
for you.
>
>> Network file shares can break it.
>
>
> Which ones, and how?
NFS and Samba are a few of these (They are the ones I know off the top
of my head, as they are the ones I've worked with). Some of the default
options for Samba deal with character set and encoding.
>
>> CVS can break it.
>
>
> Not likely, given the nature of UTF-8. Give me an example.
Check in and out with different character sets and things get mangled.
I've seen it happen at work. Much depends on your sever, your specific
clients and how "helpful" they are.
>
>> Sneaker-net can break it.
>
>
> Only if your floppydisk's filesystem munges the character set.
Many do. Not the floppy itself, but the going in and out of "foreign"
file systems stuff.
Again, I've seen this going from PC to Mac all the time. Since I've been
working on software for the non-US market on and off for abou 1% years
now, I've seen it happen and know to go looking for it.
>
>> And having worked in multi-platform teams, I've seen a lot of those
>> first-hand.
>
>
> So it should be a snap for you to give me concrete examples of each of
> your claims.
Remember, I said "can" in all these, not "will" or "most likely to". I
don't have time at the moment to recreate all those situations (since I
do have to work for a living in addition to the open source projects I
help with).
Just to be sure, I am a proponent of using UTF-8 for data, and do so
mainly for XML, but for Java sources that's still not 100% safe as we
would like it to be.
- Previous message: Jonathan Fuerth: "Re: JSplitPane moving alone"
- In reply to: Jonathan Fuerth: "Re: How to display German Umlauts correctly on Mac and Unix"
- Next in thread: Jonathan Fuerth: "Re: How to display German Umlauts correctly on Mac and Unix"
- Reply: Jonathan Fuerth: "Re: How to display German Umlauts correctly on Mac and Unix"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|