Re: Attention: European C/C++/C#/Java Programmers-Call for Input



On Thu, 29 Jan 2009 09:28:09 -0600, "Paul K. McKneely"
<pkmckneely@xxxxxxxxxxxxx> wrote:


Did you miss the key point? *UNICODE*. They very specifically choose a
*standard* for their encodings, not something incompatible and
proprietary. In particular, it's very useful to be able to write comments
and strings in Unicode - many modern languages allow it. If you had
suggested using Unicode, or Latin-1, or listened to the idea when it was
suggested, then you'd have got far more support - it's the idea of have a
proprietary half-baked encoding that is incompatible with every other tool
that is "incredibly stupid".

My fault for phrasing my original question badly. I should
never have mentioned the words "character set". Forget that
there is an internal encoding method that is used in the compiler
tools for this new language whose codes will never be seen by its users.
The programming lanugage supports only a subset of the complete
UNICODE character set regarding the Western European
alphabetics. The language only recognizes a maximum of 254
alphanumerics (Basic Greek and Cyrillic are included) for variable
names etc. including the underscore which is regarded as alphabetic
but ordinally precedes all others. If Western European
programmers had to choose a subset of these for language
support, which ones would they be?

I still do not understand why you want to use some own internal
representation instead of e,g. UTF-8. For any language using a Latin
script for identifiers, the effective string length is 1.0x or rare
cases 1.1x times the length of the identifier. For Cyrillic or Greek,
the ratio is 2.0.

So the extra memory consumption e.g. in compiler symbol tables are
negligible.

Regarding linkers, UTF-8 global symbol names should not be a problem,
unless the object language uses the 8th bit for some kind of signaling
(such as end of string) or otherwise limits the valid bit
combinations.

Of course the UTF-8 encoding may increase the identifier length, but
at least for a linker that usually examines only a specific number of
bytes, such as 32, the only risk is that two identifiers are not
unique within 32 bytes i.e. 16 characters in Greek or Cyrillic or 10
graphs in some East-Asian script.

Paul

.



Relevant Pages

  • Re: eval and unicode
    ... encoding your terminal/file/whatnot is written in. ... you have a byte string that starts with u, then ", then something ... The first item in the sequence is \u5fb9 -- a unicode code point. ...
    (comp.lang.python)
  • RE: VBA question: How to extract cell values in different language
    ... language is entered, but it seems like all that data is lost when the VBA ... about having binary data and not unicode data confirms my suspicions. ... You are have 256 binary characters. ... First column has the string IDs ...
    (microsoft.public.excel.programming)
  • Ruby, Unicode - ever?
    ... Why can't ruby use at least ICU libs? ... proper Unicode support, don't try to cheat me, that it's OK and enough, ... Ruby String class in current state is TOO MUCH OVERLOADED: ... encoding is senseless - this is plain bit stream. ...
    (comp.lang.ruby)
  • Re: Why asci-only symbols?
    ... >> Perhaps string equivalence in keys will be treated like numeric equivalence? ... I know typewill be and in itself contain no encoding information now, ... >and a Unicode string, the system default encoding ...
    (comp.lang.python)
  • Re: Unicode drives me crazy...
    ... every string on some level). ... Python needs to know what encoding is used. ... The decode instruction converts s into a unicode string - where Python ...
    (comp.lang.python)