Re: Attention: European C/C++/C#/Java Programmers-Call for Input



Paul K. McKneely wrote:
This language has an extended character set and, although
all of the key words will (still) be in English, identifiers
(i.e. names of things) can use additional European
characters (such as those with accents, diaeresis, cedilla etc).

Like they already can in Java, C, and C++?

Support for Unicode characters is in the C and C++ standards, but many
compilers don't implement it. This may give you a hint how many people
want it. Being German, I am satisfied if I can use my fünny chäracters
in cömments and strings. But even there, any code that has a remote
chance of being shared with anyone else gets English comments. When a
Finn came along wanting to help with a program I wrote, I had quite some
work to explain my German comments (as an excuse, however, some of them
were at that time over 10 years old, written in a time where I was not
so fluent in English). But I would definitely not switch programming
languages just to use my funny characters.

For efficiency, a 254-character subset of them are
going to be used in creating a character space
that encodes them into a single byte. These will
not only be automatically byte-endian independent
but will also be in alphabetic order so that sorting
can take place directly on their numeric values.

Why ignore Unicode and invent yet another incompatible encoding? How
should people edit their source code? Remember, you'd have to build a
whole toolchain supporting your new character set. If my embedded
programs make serial outputs in German, they use the Latin
transcription, because terminal programs don't even agree upon whether
to use Latin-1 or Codepage-437/-850.

Automatic alphabetic sorting is not a useful goal one would want from a
character encoding, because it's not possible in general, and doesn't
save you any work if you want to do it right for your problem.

- In German telephone books, "ä" sorts as "ae" (the official Latin
transcription). In German dictionaries, "ä" sorts as "a". In Finnish,
it sorts after "z".

- Almost everywhere, "ß" sorts as "ss". It also doesn't have a
wide-spread capital equivalent (although an Unicode codepoint has
been allocated for it recently).

- In Turkish, the capital letter of "i" is "İ" (U+0130), and the
lower-case letter of the thing you know as a capital "I" is "ı"
(U+0131).

Even though it might be possible to fit most Western and Central
European languages plus the standard ASCII repertoire into a common
8-bit character set, you'll probably have to ignore Cyrillic and Greek,
and still tweak a bit. Latin-1 and Latin-2 taken together have about 280
characters, not counting control charactes.

One attempt of such a character set is the EBU character set used in
RDS/RDBS, e.g. ftp://ftp.rds.org.uk/pub/acrobat/rbds1998.pdf page 92; I
haven't checked how complete it is. However, it was probably designed
with the intend to implement it on 8-bit micros :-)

A reference on the subject of European character sets would be
much appreciated.

"The Unicode Standard, Version 5.0". Plus Wikipedia.


Stefan

.



Relevant Pages

  • Re: Enhanced Unicode support for "Go" tools
    ... the point to remember is that UNICODE is a _character ... It's the fonts, the OS and the application which work together ... society for the protection of French from English ...
    (alt.lang.asm)
  • Re: Proposal: require 7-bit source strs
    ... >> character encodings make more sense. ... Programs that show text still need to know which character set the ... there are many non-'global' applications too where Unicode is ... I don't know Perl 6, but Perl 5 is an excellent example of how not do to ...
    (comp.lang.python)
  • Re: case-sensitivity
    ... I think that Unicode identifiers make things worse for the reasons ... a good character set standard waiting to be uncovered. ... codepoints paying particular attention to mirroring ...
    (comp.lang.scheme)
  • Re: UTF8: cgi ist staerker als ich
    ... UNICODE bzw. eigentlich UCS (Universal Character Set) ist kein Encoding, ... Ein "Character Set" definiert eine Menge von unterscheidbaren Zeichen. ...
    (de.comp.lang.perl.cgi)
  • Re: VMS and Unicode
    ... I'm not sure anyone fully understands Unicode, ... Which ASCII? ... chance you are talking about the DEC Multinational Character Set or one ...
    (comp.os.vms)