Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Stefan Reuther <stefan.news@xxxxxxxx>
- Date: Tue, 27 Jan 2009 19:54:46 +0100
Paul K. McKneely wrote:
This language has an extended character set and, although
all of the key words will (still) be in English, identifiers
(i.e. names of things) can use additional European
characters (such as those with accents, diaeresis, cedilla etc).
Like they already can in Java, C, and C++?
Support for Unicode characters is in the C and C++ standards, but many
compilers don't implement it. This may give you a hint how many people
want it. Being German, I am satisfied if I can use my fünny chäracters
in cömments and strings. But even there, any code that has a remote
chance of being shared with anyone else gets English comments. When a
Finn came along wanting to help with a program I wrote, I had quite some
work to explain my German comments (as an excuse, however, some of them
were at that time over 10 years old, written in a time where I was not
so fluent in English). But I would definitely not switch programming
languages just to use my funny characters.
For efficiency, a 254-character subset of them are
going to be used in creating a character space
that encodes them into a single byte. These will
not only be automatically byte-endian independent
but will also be in alphabetic order so that sorting
can take place directly on their numeric values.
Why ignore Unicode and invent yet another incompatible encoding? How
should people edit their source code? Remember, you'd have to build a
whole toolchain supporting your new character set. If my embedded
programs make serial outputs in German, they use the Latin
transcription, because terminal programs don't even agree upon whether
to use Latin-1 or Codepage-437/-850.
Automatic alphabetic sorting is not a useful goal one would want from a
character encoding, because it's not possible in general, and doesn't
save you any work if you want to do it right for your problem.
- In German telephone books, "ä" sorts as "ae" (the official Latin
transcription). In German dictionaries, "ä" sorts as "a". In Finnish,
it sorts after "z".
- Almost everywhere, "ß" sorts as "ss". It also doesn't have a
wide-spread capital equivalent (although an Unicode codepoint has
been allocated for it recently).
- In Turkish, the capital letter of "i" is "İ" (U+0130), and the
lower-case letter of the thing you know as a capital "I" is "ı"
(U+0131).
Even though it might be possible to fit most Western and Central
European languages plus the standard ASCII repertoire into a common
8-bit character set, you'll probably have to ignore Cyrillic and Greek,
and still tweak a bit. Latin-1 and Latin-2 taken together have about 280
characters, not counting control charactes.
One attempt of such a character set is the EBU character set used in
RDS/RDBS, e.g. ftp://ftp.rds.org.uk/pub/acrobat/rbds1998.pdf page 92; I
haven't checked how complete it is. However, it was probably designed
with the intend to implement it on 8-bit micros :-)
A reference on the subject of European character sets would be
much appreciated.
"The Unicode Standard, Version 5.0". Plus Wikipedia.
Stefan
.
- Follow-Ups:
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Boudewijn Dijkstra
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- References:
- Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Paul K. McKneely
- Attention: European C/C++/C#/Java Programmers-Call for Input
- Prev by Date: Re: measure distance
- Next by Date: Re: Windows7 - Your Accessment?
- Previous by thread: Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- Next by thread: Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- Index(es):
Relevant Pages
|