Re: Java Newbie Question: Character Sets, Unicode, et al
From: Michael Borgwardt (brazil_at_brazils-animeland.de)
Date: 10/18/03
- Next message: Roedy Green: "Re: Outsourcing to India and China"
- Previous message: David Zimmerman: "Re: Question on default & protected member access"
- In reply to: BLG: "Java Newbie Question: Character Sets, Unicode, et al"
- Next in thread: Roedy Green: "Re: Java Newbie Question: Character Sets, Unicode, et al"
- Reply: Roedy Green: "Re: Java Newbie Question: Character Sets, Unicode, et al"
- Reply: John C. Bollinger: "Re: Java Newbie Question: Character Sets, Unicode, et al"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sat, 18 Oct 2003 01:08:14 +0200
BLG wrote:
> I understand that Unicode is a 16-bit character set and that true
> ASCII is a 7-bit representation.
Actually, Unicode is not really a character set in the way ASCII is,
and it is not restricted to 16 bits.
Unicode is a standard that assigns glyphs (characters) to numeric codes.
How these codes are concretely represented as bytes is what an encoding
or charset specifies, which is what ASCII is. There are encodings where
the number of bits used varies depending on each character, like UTF-8.
There are even stateful encodings.
> When I look at my source files in a
> hex editor, they appear to be extended ASCII 8-bit format (which I
> assume is the Windows default for a text file).
Namely Windows Codepage 1252, which is nearly the same as
ISO-8859-1, aka Latin 1, the most common encoding for western
European languages.
> OK - I assume then
> that the JRE uses Unicode character sets, but javac uses some 8-bit
> character set. Is this correct?
Nearly. How the JRE internally represents Strings is not really
specified, but the usualy way is to use 16bit per character
in a straightforward way. javac, on the other hand, uses the
platform standard encoding (unless otherwise specified on the
command line), with an additional capability to use unicode
escape sequences (\Uxxxx), when reading in source files. The
class files contain Strings encoded as UTF-8.
> But beyond that, should I even care what the character set is?
> Assuming, of course, internationalization is not a priority for me.
Yes, it still is important when writing text out to or reading from
from a file or network socket. It's quite likely that at some point
you'll use *some* non-ASCII character, and in fact it is not even
guaranteed that all encodings represent even pure ASCII text
identically.
> Also, how do I determine what character set Windows is using?
More recent Windows versions (since 2000 I think) also use Unicode
internally as far as possible, but older applications that can't
use a "traditional encoding" that differs between languages.
This is the platform default encoding.
In Java, it's a System property, file.encoding or some such.
> How do I change character sets in Windows?
There's an option in the country&language settings somwhere that
changes the default encoding used for older apps.
> And lastly, what is the relationship between a character set and a
> font?
An encoding defines relationships between numeric codes or byte
representations thereof and glyphs. A font defines how the glyphs
are drawn on the screen. Different abstract glyphs can be (and
sometimes are) assigned the same shape in a font, and nearly all
fonts contain only shapes for a subset of the glyphs defined in
Unicode.
- Next message: Roedy Green: "Re: Outsourcing to India and China"
- Previous message: David Zimmerman: "Re: Question on default & protected member access"
- In reply to: BLG: "Java Newbie Question: Character Sets, Unicode, et al"
- Next in thread: Roedy Green: "Re: Java Newbie Question: Character Sets, Unicode, et al"
- Reply: Roedy Green: "Re: Java Newbie Question: Character Sets, Unicode, et al"
- Reply: John C. Bollinger: "Re: Java Newbie Question: Character Sets, Unicode, et al"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|
|