Re: Attention: European C/C++/C#/Java Programmers-Call for Input



I still do not understand why you want to use some own internal
representation instead of e,g. UTF-8. For any language using a Latin
script for identifiers, the effective string length is 1.0x or rare
cases 1.1x times the length of the identifier. For Cyrillic or Greek,
the ratio is 2.0.

Simply encoding a kazillion different characters
is not the whole picture. As Boudewijn Dijkstra
pointed out, trying to alphabetize all of the potential
UNICODE variables is impossible. (Those are his
words, not mine and the ramifications go far beyond
just this issue). So how do you alphabetize, search
and list on an unwieldy character set for many
purposes such as showing a listing to the programmer
in his tool chain? That is not to mention that 21-bits
(or 32-bits) are already used up in just the character's
code. The new programming language supports fonts,
color (foreground and background), attributes, size etc.
Do you think it is a good idea
to have to expand these basic character codes to
64/ 96/128 or even 256 bits in width just to cram it all in?
The web people would want to encode it all in ASCII
HTML-style tags which I think is a really bad idea.
The overwhelming consensus among responders to these
threads have voiced that they are not going to use
anything beyond ASCII anyway. And with all of
this text stuff, you haven't even begun to talk about
how you are going to achieve all of the very advanced
(and very difficult) stuff in the programming language,
(much of which hasn't ever been done before)
while carrying this huge load of excess baggage
on your back. I needed to define some additional
characters that weren't in ASCII (and aren't in UNICODE)
for the purposes of the programming language (which
predates UNICODE and UTF-8 BTW) Additional
characters in APL being sited as the downfall for that
language is not well founded in light of the fact that,
when it came out, you had to put out a couple of
thousand dollars for a hard-wired specialized
terminal just to program in that language. That is
besides the fact that it was not designed for the
kinds of things that I want to do with it (such as
writing operating systems and device drivers)
Do you see my point(s)?

Simple, lean and mean, but more powerful
than anything we have now. That is what I am
shooting for. When symbols need to be
converted to whatever format when object
files are produced, that's where the necessary
conversions will be done.
This will keep the core of the tools much simpler
(and smaller and run faster) so that the whole project
won't collapse when I try to do the really difficult
things that were the primary goals that I started
out to accomplish in the first place.

So the extra memory consumption e.g. in compiler symbol tables are
negligible.

Regarding linkers, UTF-8 global symbol names should not be a problem,
unless the object language uses the 8th bit for some kind of signaling
(such as end of string) or otherwise limits the valid bit
combinations.

Of course the UTF-8 encoding may increase the identifier length, but
at least for a linker that usually examines only a specific number of
bytes, such as 32, the only risk is that two identifiers are not
unique within 32 bytes i.e. 16 characters in Greek or Cyrillic or 10
graphs in some East-Asian script.

Paul


I do want you to know that I do very much
appreciate your input. This issue about object
formats supporting UNICODE is going to be
a real help when it comes time to generating
machine code.


.



Relevant Pages

  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... For any language using a Latin ... Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. ... But you'll find something that does a reasonable job and *will* work perfectly for most programmers who stick to ASCII identifiers. ... A related problem is if you are making identifiers case-insensitive - it's hard to figure out cases for non-ASCII characters. ...
    (comp.arch.embedded)
  • Re: case-sensitivity
    ... A programming language is not like natural ... language, and shouldn't be. ... Identifiers are ... like English, ordinary working knowledge of English doesn't help much ...
    (comp.lang.scheme)
  • Re: No call for Ada (was Re: Announcing new scripting/prototyping language)
    ... > programming language. ... for handling international text that may or may not support UTF-8. ... There's no standard networking code, ... that will normalize Unicode text or sort it in a language dependent ...
    (comp.lang.ada)
  • Re: LC_CTYPE=UTF-8 in ksh
    ... And the idea of UTF-8 is to be language independent, ... The "UTF-8" encoding is language ... shall define character classification, case conversion, and other ...
    (comp.unix.shell)
  • IIS 6.0 / UTF-8 Include File Issue
    ... All the language is included in variables in UTF-8 include files. ... IIS seems to implicitly think the page is UTF for text inputs. ... Just making all the pages UTF-8 causes is other display problems as IIS 6.0 ...
    (microsoft.public.inetserver.iis)