Re: Attention: European C/C++/C#/Java Programmers-Call for Input



Paul K. McKneely wrote:
I still do not understand why you want to use some own internal
representation instead of e,g. UTF-8. For any language using a Latin
script for identifiers, the effective string length is 1.0x or rare
cases 1.1x times the length of the identifier. For Cyrillic or Greek,
the ratio is 2.0.


I would suggest you start by giving up on all your thoughts of specific character sets. Simply make a straight decision now - you will use UTF-8. No other encodings - no Latin-1, no UTF-16, no home-made character sets, no extra fonts. Take it as a fixed decision and work with it for a few days to see how it fits your needs. Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. If you really put in this effort and find that UTF-8 does not fit your needs, what have you lost? A couple of days work here is a drop in the ocean compared to the man-years it will take to work with your home-made encoding, and you will at least have the benefit of a better understanding of your problem. You might even be able to explain it to other people in a way that makes sense.

Simply encoding a kazillion different characters
is not the whole picture. As Boudewijn Dijkstra
pointed out, trying to alphabetize all of the potential
UNICODE variables is impossible. (Those are his
words, not mine and the ramifications go far beyond
just this issue). So how do you alphabetize, search
and list on an unwieldy character set for many
purposes such as showing a listing to the programmer

If you need to alphabetize, there should be no shortage of existing library routines for sorting in UTF-8. It's not easy - differences in locales can cause endless troubles, so you might not get a perfect solution. But you'll find something that does a reasonable job and *will* work perfectly for most programmers who stick to ASCII identifiers.

A related problem is if you are making identifiers case-insensitive - it's hard to figure out cases for non-ASCII characters. So stick to case-sensitive identifiers.

in his tool chain? That is not to mention that 21-bits
(or 32-bits) are already used up in just the character's
code.

I have no clue as to what you are talking about here.

The new programming language supports fonts,
color (foreground and background), attributes, size etc.
Do you think it is a good idea
to have to expand these basic character codes to
64/ 96/128 or even 256 bits in width just to cram it all in?
The web people would want to encode it all in ASCII
HTML-style tags which I think is a really bad idea.

Are you suggesting that you are including font, colour, etc., directly in the source code? And here was me thinking that a proprietary character encoding was an "amazingly stupid idea".

The overwhelming consensus among responders to these
threads have voiced that they are not going to use
anything beyond ASCII anyway. And with all of
this text stuff, you haven't even begun to talk about
how you are going to achieve all of the very advanced
(and very difficult) stuff in the programming language,
(much of which hasn't ever been done before)
while carrying this huge load of excess baggage

Who is "you" who are going to achieve all this? Do you mean the developers of the tools (i.e., you and your colleagues), or do you mean your users? And if it is us potential users, what is this "very advanced stuff" you are talking about? If we knew the specific aims of your language - what it is that makes it better than existing alternatives - it would be easier to advise you.

on your back. I needed to define some additional
characters that weren't in ASCII (and aren't in UNICODE)
for the purposes of the programming language (which
predates UNICODE and UTF-8 BTW) Additional

First off, you do *not* need to define additional characters. It's conceivable that your tools might *benefit* from additional characters (although, as I said, we know nothing about your tools). But they don't *need* them.

Secondly, Unicode has openings for additional domain-specific characters - you can add them without losing all the other benefits of Unicode (of course, you'll have to provide a suitable font).

characters in APL being sited as the downfall for that
language is not well founded in light of the fact that,
when it came out, you had to put out a couple of
thousand dollars for a hard-wired specialized
terminal just to program in that language. That is
besides the fact that it was not designed for the
kinds of things that I want to do with it (such as
writing operating systems and device drivers)
Do you see my point(s)?


No, I don't see your point at all. It reads as though you are saying APL's lack of popularity was not that it had extra characters, but that it needed an expensive specialised terminal (which was solely because of its special characters).

The main reason for APL's lack of popularity *is* the special characters. Even though you don't need special hardware (you use a specialised keyboard map and extra fonts), the characters make it impossible to read and understand for the non-expert, and extremely slow to enter expressions. It is *vastly* easier to write for example "range(R)" than "ιR" because you don't have to find the special character. It is also *vastly* easier to read and pronounce, and to understand "range(R)" than "ιR" even if you have never used the language in question (Python). To take an example from wikipedia's APL page, here is an expression to give a list of prime numbers up to R:

(∼R∈R°.×R)/R←1↓ιR

The direct Python translation would be:

[p for p in range(2, R+1) if not p in [x*y for x in
range(2, R+1) for y in range(2, R+1)]]

The APL version is certainly shorter - but nevertheless is slower and harder to write. APL's power and conciseness comes from the power of its built-in functions, not the fact that most have a single weird symbol instead of a multi-character name.

Simple, lean and mean, but more powerful
than anything we have now. That is what I am
shooting for. When symbols need to be
converted to whatever format when object
files are produced, that's where the necessary
conversions will be done.
This will keep the core of the tools much simpler
(and smaller and run faster) so that the whole project
won't collapse when I try to do the really difficult
things that were the primary goals that I started
out to accomplish in the first place.

So the extra memory consumption e.g. in compiler symbol tables are
negligible.

Regarding linkers, UTF-8 global symbol names should not be a problem,
unless the object language uses the 8th bit for some kind of signaling
(such as end of string) or otherwise limits the valid bit
combinations.

Of course the UTF-8 encoding may increase the identifier length, but
at least for a linker that usually examines only a specific number of
bytes, such as 32, the only risk is that two identifiers are not
unique within 32 bytes i.e. 16 characters in Greek or Cyrillic or 10
graphs in some East-Asian script.

Paul


I do want you to know that I do very much
appreciate your input. This issue about object
formats supporting UNICODE is going to be
a real help when it comes time to generating
machine code.


.



Relevant Pages

  • Re: PEP 3131: Supporting Non-ASCII Identifiers - ambiguity issues
    ... two characters which look the same. ... The latter is a real problem in a language like Python with implicit ... This limits mixing of scripts ... We have to have visually unique identifiers. ...
    (comp.lang.python)
  • Re: PEP 3131: Supporting Non-ASCII Identifiers
    ... Lengthy texts are either already available digitally or are entered by someone skilled in the language. ... Since - AFAIK - you have to type some characters before they can be of any help, I don't think they can help much here. ... I also did have to copy/paste identifiers to program, and found it extremely difficult to handle. ... I have never learned Japanese but have had to deal with Japanese text at a couple of jobs and it isn't that big of a problem. ...
    (comp.lang.python)
  • Re: PEP 3131: Supporting Non-ASCII Identifiers
    ... identifiers in Python. ... The diatribe about cross language understanding of Python code is IMHO ... Not providing an explicit listing of allowed characters is inexcusable ... categories uppercase letters, lowercase letters, titlecase ...
    (comp.lang.python)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... For any language using a Latin ... script for identifiers, the effective string length is 1.0x or rare ... The new programming language supports fonts, ... predates UNICODE and UTF-8 BTW) Additional ...
    (comp.arch.embedded)
  • Re: Beyond ascii
    ... Because there is no reason to, and that causes more problems then it solves, ... I guess the observation is the a programming language is not a natural ... Or having distinct characters that have exactly the same appearance. ... Consider how right-to-left character identifiers should appear in code. ...
    (comp.lang.scheme)