Re: Ada, Gnat and Unicode

From: Robert I. Eachus (rieachus_at_comcast.net)
Date: 10/23/03


Date: Thu, 23 Oct 2003 15:49:34 GMT

Jano wrote:

> I'm thinking about the best procedure to internationalize some Ada
> program and I have some doubts. Please shed some light if you can.

Okay.

> AFAIK, the Ada Character type is the 256 first values from ISO 10646
> (Latin1). In the same fashion, Wide_Character are the 2**16 values of
> that same ISO. The ARM furthermore says that an implementation can
> provide alternate representations conforming to local conventions, but
> later it states that said representation should be a proper subset of
> these two. I'm not very sure about what that implies.

First, that is correct. By default Standard.Character is Latin1. Some
compilers, such as GNAT allow using other mappings.

Second, what it means by the Implementation Advice is just that. It is a
"nice to have" feature that if you choose say Latin2 there is a defined
mapping from Character to Wide_Character. If you choose some other
character set that is not in the BMP, it may not be possible. (For
example Klingon, or Japanese Shift-JIS. ;-) All this says is vendors,
please, if the mapping makes sense, provide it. And in fact the GNAT RM
does document under Implementation Advice, that JIS and IEC Japanese
encodings do not follow it, because for these two encodings, it doesn't
make sense to do so.

> Some old discussion suggest that 10646 and Unicode are equivalent, but
> it seems that later they dissociated. In any case Unicode is more than
> the 2**16 values that Wide_character can hold so I'm not sure that
> Wide_character is useful at all (?)

The best way to describe the relationship between ISO 10646-1 and
Unicode is that the BMP (and some other planes of ISO 10646-1) are
exactly mapped to Unicode and vice-versa. Unicode adds some things as
part of the standard that are not part of ISO 10646-1 and vice-versa,
but these areas where the standards differ can be for the most part
ignored. For example, the ISO 10646 definition of UTF-8 allows for
representing any (4 octet, 32-bit) character in UTF-8, while the Unicode
standard only covers the encoding for Unicode.

The practical effect of this is that characters outside the BMP but in
Unicode have at least two potential representations. But if you get
that far, you have already had to deal with the alternate
representations of characters in the BMP through composition. (For
example adding a cedilla to a "c".) Also, Unicode is stricter in
determining which encodings should and should not be used.

If you use UTF-8 for source input in GNAT, be aware that they only
support UTF-8 for BMP characters, full UTF-8 including 6 octet encodings
is not supported. (Note that all Unicode characters are effectively
supported in GNAT, although you will have to use two 16-bit encodings as
three octet sequences giving a six octet encoding...)

> Anyhow, I was thinking of using UTF8 encoding. That's convenient as it
> can hold anything in the Unicode world, is space efficient, provides
> good interoperability with other languages/Packages (GtkAda, Java,
> ...).
>
> My doubt principally comes from behavior when you're not using a
> Latin1 OS, for example a Chinese Windows. When you do some I/O, for
> example a read from console with Text_IO.Get (Wide_Text_IO?). Or when
> using Gnat.Directory_Operations to enumerate files.
>
> I don't find information in the Gnat UG/RM about these things.

Look again, in the GNAT Users Guide for "Foreign Language Representation."

> What will these functions return? It's specified somewhere, or will they
> pass the bytes from the underlying OS calls inside a String so I can't
> know in advance what to expect?

The real problems are in interpreting Strings and Wide_Strings and
deciding when two Strings or Wide_Strings should compare true. As long
as the canonicalization of the representations is outside your
application, great. (For example, the OS probably provides a call for
converting a Unicode string to a canonical representation.) Unless you
really want to get deeply into writing Unicode (or ISO 10646-1) support,
use whatever internationalization facilities the OS provides. Doing a
better (or worse) job than the OS will get you no thanks, or even if you
implement exactly the same rules and then the OS is updated.

-- 
                                                     Robert I. Eachus
"Quality is the Buddha. Quality is scientific reality. Quality is the 
goal of Art. It remains to work these concepts into a practical, 
down-to-earth context, and for this there is nothing more practical or 
down-to-earth than what I have been talking about all along...the repair 
of an old motorcycle."  -- from Zen and the Art of Motorcycle 
Maintenance by Robert Pirsig


Relevant Pages

  • Re: If you could add anything you want
    ... I presume you mean that you like the flexibility the ISO and Unicode consortium ... I prefer to be able to manipulate characters as integers. ... to break any finite scheme. ...
    (comp.lang.java.programmer)
  • Re: platforms default charset ?
    ... Then you convert it into two byte representations. ... The version where you use the UTF-8 byte encoding can't fail. ... It is made to represent Unicode characters, and you provide Unicode characters for a start. ...
    (comp.lang.java.programmer)
  • Re: Unicode Support
    ... >> (I know this is a poor example, but think about other languages, eg ... First things first, when you register your RosAsm windows classes, you ... the messages with ANSI / UNICODE parameters in ANSI or UNICODE form... ... with their alphabet characters, as with the numbers and punctuation...so, ...
    (alt.lang.asm)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • =?windows-1252?Q?Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogate_Al?= =?windows-1252?Q?pha
    ... characters of an exotic eastern language using an ASCII keyboard. ... It is true to say that any keyboard of any language can be simulated ... communicate in large volume with China or Japan using CJK from Unicode ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)