Re: Attention: European C/C++/C#/Java Programmers-Call for Input



"David Brown" <david.brown@xxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:O_6dnYV7E52xuR_UnZ2dnUVZ8u2dnZ2d@xxxxxxxxxxx
I would suggest you start by giving up on all your thoughts of specific
character sets. Simply make a straight decision now - you will use UTF-8.
No other encodings - no Latin-1, no UTF-16, no home-made character sets,
no extra fonts. Take it as a fixed decision and work with it for a few
days to see how it fits your needs. Look at existing tools and source
code that supports UTF-8, and see how it can make your work easier and
give a result that users might actually be able to *use*. If you really
put in this effort and find that UTF-8 does not fit your needs, what have
you lost? A couple of days work here is a drop in the ocean compared to
the man-years it will take to work with your home-made encoding, and you
will at least have the benefit of a better understanding of your problem.
You might even be able to explain it to other people in a way that makes
sense.

I want you to know that I lost most of a good night's
sleep over this post. In my anguish my brain mulled
it over and I came up with a plan. First, I will give
you some background (and a great deal of credit
for my suffering :). Original conception for ?Text
was circa 1985. Actual development began in 1988.
It is basically a superset of ASCII. The ASCII part,
as you well know, is not proprietary. But the key
point is that ?Text began in 1985 as a byte-endian
independent streaming format (as well as a flat 32-bit
character format) much like UTF-8 which itself
was in flux until late 2003. Both ?Text and UTF-8
use the high-order bit to determine what comes next
as escape bytes in their byte stream encoding.
Although streaming ?Text is much more all-inclusive
than UTF-8, its symbol set is not as large (which is
really all there is to UNICODE anyway). So I am
really not just starting out as you might have the
impression. I have probably 10 full man years
already into this. I just started working on the
5th generation of the ?Text editor since about
August and have been working on the 2nd
generation compiler since about a year ago.
It is early enough, I could make the 5th generation
editor change its course but I will have to start
completely over on the compiler (3rd generation).

I really have a lot more than you probably think
at stake. My business partner (in a medical
networking device communications business) keeps
urging me that I need to think about retiring in about
10 years. If I were to abandon what I have
already done, the whole thing would collapse and
I would have little more than UNICODE left.
Rather than do this, I would just give up and
do something completely different.

Want to hear the music album that I wrote from
2004 to 2007? You can get free excerpts at:
http://cdbaby.com/cd/pkmckneely
The theme music with (full synthetic orchestral
sound) is based on a science-fiction trilogy that I
am writing. To research the stories, my wife and
I spent nearly 3 weeks in South Africa to do
background research, get experience and go
birdwatching. I have over 4 hours of high-definition
video from the trip. Yes, I am a videographer as well.
Anyway, this is neither here nor there. I was just
letting you know that I have plenty of
other things to do during my retirement.


Sooooooooooo.......
I am not going to start (completely) over.

But listen to my plan....
The next generation system will become a binary
superset of the UNICODE/HTML suite *instead*
of ASCII. Keep in mind that you HAVE to put
wrappers around UNICODE to do anything at all.
That is data AND program wrappers. Even the
UNICODE standard states that it only gives you
raw character codes and does not tell you how to
process them. The next version of ?Text will be
that wrapper. So I have hashed out a way to
merge the two and the combination will still be
called ?Text which will be its "internal" format.
From ?Edit, you will be able to import either
plain old raw UNICODE via UTF-8 or UTF-8/HTML
with all of the visual properties of HTML. Or you
will be able to *load* and *save* from/to native ?Text.
The ?Text files should be considerably smaller
and easier to parse than their HTML equivalents.

But the compiler will require streaming ?Text files
for input because it is far more efficient and much
easier to parse than HTML. You can run a converter
as the first step in the tool chain if you can tolerate the
bloated HTML files as your main source code format.
Straight UTF-8 files will be smaller but they will lack
any visual enhansements. Either way, you can
use your favorite generic UNICODE or HTML editor.
But ?Edit will be much more useful and much
easier to use for programming in the ? programming
language. Plus the saved source files will be much
smaller and much easier to parse. A lexical analyzer
might be next in the tool chain which will accept only
the ?Text format. Following that is the parser and
then the code generator which will target the specific
processor architecture. Intermediate code optimizers
can be placed after the parser or generic UNICODE
aware but architecture-specific
optimizers can be placed after the code generator
(or as part of the code generator). Generic assemblers
can be used in such case that the output of the code
generator is assembly language.

Standard UNICODE-compatible linkers and
standard downloaders can be used off-the-shelf.

I am the holder of the domain name <phisystem.net>
which will be the central repository for all information
(including coding standards) for the ?System. I will
need some time to get the website up
and running but that is my plan. I have spent literally
many years writing embedded operating systems and
this whole thing is intended to go in the direction of
development of the ?OS operating system which will
be sort of a demonstration for how to write operating
systems, device drivers and applications in the new
language. There will be many parts to this system
so contributors will be welcome. I have to make
money somehow (my wife mostly pays the bills and
I get a contract job every now and then designing
micro-controller based circuit boards along with
embedded applications and doing industrial training
videos using computer animation) so my company's
website will probably be selling "how to" books and
training videos to aid developers besides the front-end
development tools. I have been a computer
animator and musical composer for about 5 years
so I will be able to offer quite a few products
in support of this system. What I would REALLY
like to do is make <phisystem.net> a clearing house
for independent software developers (that support
the ?System) and give them a 90% royalty of the
software that the organization sells for them (much
like what CDBaby does for independent musicians,
see the CDBABY link I gave above).

in his tool chain? That is not to mention that 21-bits
(or 32-bits) are already used up in just the character's
code.
I have no clue as to what you are talking about here.

If you look at UTF-8 more closely, it encodes a
series of 21-bit "flat characters" (which is the current
implementation of UNICODE). In other words,
the escape sequences have to be expanded to
21 bits to obtain the flat version of the character
in its full implementation. UTF-8 is just a way
to stream them out (as to a file ) so that the
format is no longer byte-endian dependent. It is
also inherently vert efficient for raw ASCII
character storage (just as in ?Text except that
it allocates no code space for anything but raw
character codes)

According to Wikipedia, UTF-8 was an outgrowth
of ISO 10646 which was a 32-bit flat format.
UNICODE may grow to 32-bits (given the current
trend continues into the future) seeing how the original
16-bit version was found to be wanting.

The new programming language supports fonts,
color (foreground and background), attributes, size etc.
Do you think it is a good idea
to have to expand these basic character codes to
64/ 96/128 or even 256 bits in width just to cram it all in?
The web people would want to encode it all in ASCII
HTML-style tags which I think is a really bad idea.

Are you suggesting that you are including font, colour, etc., directly in
the source code? And here was me thinking that a proprietary character
encoding was an "amazingly stupid idea".

The overwhelming consensus among responders to these
threads have voiced that they are not going to use
anything beyond ASCII anyway. And with all of
this text stuff, you haven't even begun to talk about
how you are going to achieve all of the very advanced
(and very difficult) stuff in the programming language,
(much of which hasn't ever been done before)
while carrying this huge load of excess baggage

Who is "you" who are going to achieve all this? Do you mean the
developers of the tools (i.e., you and your colleagues), or do you mean
your users? And if it is us potential users, what is this "very advanced
stuff" you are talking about? If we knew the specific aims of your
language - what it is that makes it better than existing alternatives - it
would be easier to advise you.

on your back. I needed to define some additional
characters that weren't in ASCII (and aren't in UNICODE)
for the purposes of the programming language (which
predates UNICODE and UTF-8 BTW) Additional

First off, you do *not* need to define additional characters. It's
conceivable that your tools might *benefit* from additional characters
(although, as I said, we know nothing about your tools). But they don't
*need* them.

Secondly, Unicode has openings for additional domain-specific characters -
you can add them without losing all the other benefits of Unicode (of
course, you'll have to provide a suitable font).

characters in APL being sited as the downfall for that
language is not well founded in light of the fact that,
when it came out, you had to put out a couple of
thousand dollars for a hard-wired specialized
terminal just to program in that language. That is
besides the fact that it was not designed for the
kinds of things that I want to do with it (such as
writing operating systems and device drivers)
Do you see my point(s)?


No, I don't see your point at all. It reads as though you are saying
APL's lack of popularity was not that it had extra characters, but that it
needed an expensive specialised terminal (which was solely because of its
special characters).

The main reason for APL's lack of popularity *is* the special characters.
Even though you don't need special hardware (you use a specialised
keyboard map and extra fonts), the characters make it impossible to read
and understand for the non-expert, and extremely slow to enter
expressions. It is *vastly* easier to write for example "range(R)" than
"?R" because you don't have to find the special character. It is also
*vastly* easier to read and pronounce, and to understand "range(R)" than
"?R" even if you have never used the language in question (Python). To
take an example from wikipedia's APL page, here is an expression to give a
list of prime numbers up to R:

(?R?R°.×R)/R?1??R

The direct Python translation would be:

[p for p in range(2, R+1) if not p in [x*y for x in
range(2, R+1) for y in range(2, R+1)]]

The APL version is certainly shorter - but nevertheless is slower and
harder to write. APL's power and conciseness comes from the power of its
built-in functions, not the fact that most have a single weird symbol
instead of a multi-character name.

Simple, lean and mean, but more powerful
than anything we have now. That is what I am
shooting for. When symbols need to be
converted to whatever format when object
files are produced, that's where the necessary
conversions will be done.
This will keep the core of the tools much simpler
(and smaller and run faster) so that the whole project
won't collapse when I try to do the really difficult
things that were the primary goals that I started
out to accomplish in the first place.

So the extra memory consumption e.g. in compiler symbol tables are
negligible.

Regarding linkers, UTF-8 global symbol names should not be a problem,
unless the object language uses the 8th bit for some kind of signaling
(such as end of string) or otherwise limits the valid bit
combinations.

Of course the UTF-8 encoding may increase the identifier length, but
at least for a linker that usually examines only a specific number of
bytes, such as 32, the only risk is that two identifiers are not
unique within 32 bytes i.e. 16 characters in Greek or Cyrillic or 10
graphs in some East-Asian script.

Paul


I do want you to know that I do very much
appreciate your input. This issue about object
formats supporting UNICODE is going to be
a real help when it comes time to generating
machine code.


.



Relevant Pages

  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: Posting with XHR and ISO-8859-15
    ... UTF-8 code units can be byte values ... Latin-9, and Unicode are the same, so there wouldn't be any troubles ... URIs, I can't use encodeURIComponent. ... ISO-8859-xx in the sense that not every character that can be encoded ...
    (comp.lang.javascript)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... No other encodings - no Latin-1, no UTF-16, no home-made character sets, no extra fonts. ... Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. ... A couple of days work here is a drop in the ocean compared to the man-years it will take to work with your home-made encoding, and you will at least have the benefit of a better understanding of your problem. ... I would have little more than UNICODE left. ...
    (comp.arch.embedded)
  • Re: LC_CTYPE=UTF-8 in ksh
    ... And the idea of UTF-8 is to be language independent, ... The "UTF-8" encoding is language ... shall define character classification, case conversion, and other ...
    (comp.unix.shell)
  • Re: Unicode Delphi Win32 - which approach
    ... UTF-8 encoding is different from ANSI,... ... The first 256 Unicode characters map to the ANSI character set. ... Delphi supports UCS-2 on both platforms. ...
    (borland.public.delphi.non-technical)