Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: David Brown <david.brown@xxxxxxxxxxxxxxxxxxxxxxxxxx>
- Date: Thu, 29 Jan 2009 22:48:13 +0100
Paul K. McKneely wrote:
I still do not understand why you want to use some own internal
representation instead of e,g. UTF-8. For any language using a Latin
script for identifiers, the effective string length is 1.0x or rare
cases 1.1x times the length of the identifier. For Cyrillic or Greek,
the ratio is 2.0.
I would suggest you start by giving up on all your thoughts of specific character sets. Simply make a straight decision now - you will use UTF-8. No other encodings - no Latin-1, no UTF-16, no home-made character sets, no extra fonts. Take it as a fixed decision and work with it for a few days to see how it fits your needs. Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. If you really put in this effort and find that UTF-8 does not fit your needs, what have you lost? A couple of days work here is a drop in the ocean compared to the man-years it will take to work with your home-made encoding, and you will at least have the benefit of a better understanding of your problem. You might even be able to explain it to other people in a way that makes sense.
Simply encoding a kazillion different characters
is not the whole picture. As Boudewijn Dijkstra
pointed out, trying to alphabetize all of the potential
UNICODE variables is impossible. (Those are his
words, not mine and the ramifications go far beyond
just this issue). So how do you alphabetize, search
and list on an unwieldy character set for many
purposes such as showing a listing to the programmer
If you need to alphabetize, there should be no shortage of existing library routines for sorting in UTF-8. It's not easy - differences in locales can cause endless troubles, so you might not get a perfect solution. But you'll find something that does a reasonable job and *will* work perfectly for most programmers who stick to ASCII identifiers.
A related problem is if you are making identifiers case-insensitive - it's hard to figure out cases for non-ASCII characters. So stick to case-sensitive identifiers.
in his tool chain? That is not to mention that 21-bits
(or 32-bits) are already used up in just the character's
code.
I have no clue as to what you are talking about here.
The new programming language supports fonts,
color (foreground and background), attributes, size etc.
Do you think it is a good idea
to have to expand these basic character codes to
64/ 96/128 or even 256 bits in width just to cram it all in?
The web people would want to encode it all in ASCII
HTML-style tags which I think is a really bad idea.
Are you suggesting that you are including font, colour, etc., directly in the source code? And here was me thinking that a proprietary character encoding was an "amazingly stupid idea".
The overwhelming consensus among responders to these
threads have voiced that they are not going to use
anything beyond ASCII anyway. And with all of
this text stuff, you haven't even begun to talk about
how you are going to achieve all of the very advanced
(and very difficult) stuff in the programming language,
(much of which hasn't ever been done before)
while carrying this huge load of excess baggage
Who is "you" who are going to achieve all this? Do you mean the developers of the tools (i.e., you and your colleagues), or do you mean your users? And if it is us potential users, what is this "very advanced stuff" you are talking about? If we knew the specific aims of your language - what it is that makes it better than existing alternatives - it would be easier to advise you.
on your back. I needed to define some additional
characters that weren't in ASCII (and aren't in UNICODE)
for the purposes of the programming language (which
predates UNICODE and UTF-8 BTW) Additional
First off, you do *not* need to define additional characters. It's conceivable that your tools might *benefit* from additional characters (although, as I said, we know nothing about your tools). But they don't *need* them.
Secondly, Unicode has openings for additional domain-specific characters - you can add them without losing all the other benefits of Unicode (of course, you'll have to provide a suitable font).
characters in APL being sited as the downfall for that
language is not well founded in light of the fact that,
when it came out, you had to put out a couple of
thousand dollars for a hard-wired specialized
terminal just to program in that language. That is
besides the fact that it was not designed for the
kinds of things that I want to do with it (such as
writing operating systems and device drivers)
Do you see my point(s)?
No, I don't see your point at all. It reads as though you are saying APL's lack of popularity was not that it had extra characters, but that it needed an expensive specialised terminal (which was solely because of its special characters).
The main reason for APL's lack of popularity *is* the special characters. Even though you don't need special hardware (you use a specialised keyboard map and extra fonts), the characters make it impossible to read and understand for the non-expert, and extremely slow to enter expressions. It is *vastly* easier to write for example "range(R)" than "ιR" because you don't have to find the special character. It is also *vastly* easier to read and pronounce, and to understand "range(R)" than "ιR" even if you have never used the language in question (Python). To take an example from wikipedia's APL page, here is an expression to give a list of prime numbers up to R:
(∼R∈R°.×R)/R←1↓ιR
The direct Python translation would be:
[p for p in range(2, R+1) if not p in [x*y for x in
range(2, R+1) for y in range(2, R+1)]]
The APL version is certainly shorter - but nevertheless is slower and harder to write. APL's power and conciseness comes from the power of its built-in functions, not the fact that most have a single weird symbol instead of a multi-character name.
Simple, lean and mean, but more powerful.
than anything we have now. That is what I am
shooting for. When symbols need to be
converted to whatever format when object
files are produced, that's where the necessary
conversions will be done.
This will keep the core of the tools much simpler
(and smaller and run faster) so that the whole project
won't collapse when I try to do the really difficult
things that were the primary goals that I started
out to accomplish in the first place.
So the extra memory consumption e.g. in compiler symbol tables are
negligible.
Regarding linkers, UTF-8 global symbol names should not be a problem,
unless the object language uses the 8th bit for some kind of signaling
(such as end of string) or otherwise limits the valid bit
combinations.
Of course the UTF-8 encoding may increase the identifier length, but
at least for a linker that usually examines only a specific number of
bytes, such as 32, the only risk is that two identifiers are not
unique within 32 bytes i.e. 16 characters in Greek or Cyrillic or 10
graphs in some East-Asian script.
Paul
I do want you to know that I do very much
appreciate your input. This issue about object
formats supporting UNICODE is going to be
a real help when it comes time to generating
machine code.
- Follow-Ups:
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Paul K. McKneely
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Boudewijn Dijkstra
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- References:
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Stefan Reuther
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Boudewijn Dijkstra
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Frank Buss
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Falk Willberg
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Paul K. McKneely
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Falk Willberg
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Paul K. McKneely
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: David Brown
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Paul K. McKneely
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: David Brown
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Paul K. McKneely
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Paul Keinanen
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- From: Paul K. McKneely
- Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- Prev by Date: Re: brand new PCB with variable resistance between bus lines
- Next by Date: Re: brand new PCB with variable resistance between bus lines
- Previous by thread: Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- Next by thread: Re: Attention: European C/C++/C#/Java Programmers-Call for Input
- Index(es):
Relevant Pages
|