Re: Enhanced Unicode support for "Go" tools

From: Beth (BethStone21_at_hotmail.NOSPICEDHAM.com)
Date: 05/19/04


Date: Wed, 19 May 2004 13:05:32 +0100

Frank Kotler wrote:
> jorgon wrote:
> > If you are interested in Windows programming in assembler,
there are now
> > various enhancements to GoAsm (free assembler) and GoLink
(free linker).
> > In particular, GoAsm will now assemble Unicode files (both
UTF-8 and
> > UTF-16).
>
> He's beat us to it, Beth! Nice work, Jeremy!

I _told_ you it was a good idea, didn't I? ;)

More impressive than what I proposed, though, as UTF-16 is
included and I was only talking about the most minor change in
supporting UTF-8...which, for everyone else reading - such as
maybe Rene and Randy to note, perhaps - is an "ASCII compatible"
version of UNICODE...in fact, for strict 7-bit ASCII, UTF-8 and
ASCII are _identical_...it just makes use of that undefined 8th
bit to encode variable-length UNICODE characters (all bytes in
such sequences, though, have the 8th bit set so it's quite
simple to just look at the 8th bit to work out what is normal
ASCII and what's non-ASCII :)...UTF-8 also tends to be the
typical default character encoding on many Linux distributions -
Redhat's distro has almost abandoned the idea of supporting much
of anything else, making UTF-8 the default whatever your
locale - which is why we at LuxAsm are heading down this
path...but I can see why GoAsm has UTF-16 because, of course,
that's how Windows does things with 16-bit wide UNICODE
characters so, being on Windows, that opinion makes great sense
(though, unless you're typing lots of Kanji characters, the
UTF-8 will produce equal or smaller file sizes normally...most
certainly for Latin-based ASCII-like stuff...but most of the BMP
(first 16-bits, where the majority of the supported languages
are stored) is easily accessed in 2 or 3 bytes)...

If you're wondering, the basic idea I had in mind is very simple
(and could be applicable to other tools that currently only read
in ASCII files...as always, I don't patent or even really expect
credit, so long as someone is reasonable in not trying to
_steal_ credit...just stay honest about things and I won't mind
at all ;)...LuxAsm would read in UTF-8 source files...BUT, no,
no suggestion of having Japanese Kanji identifiers (not just far
too complex to support in all the UNICODE languages but also
would be difficult or impossible for people who don't speak
Japanese to read and follow)...the actual source code is still
just ordinary ASCII...identifiers follow the same typical rules
of A..Z, 0..9, underscore and so forth...the exception is that
the assembler would simply _pass through_ any non-ASCII
characters in character strings "as is" (and, for comments, they
are always ignored by any tool so it doesn't matter if there's
non-ASCII characters in comments either...must think this one
over some more, though, I think :)...

A minor little change but one that should Hopefully help out
those who need to write applications where Latin and English
aren't the target...so, the Russian coder can tap out Cyrillic
character strings directly into their source code...or the
Israeli coder Hebraic strings...Thai programmers Thai
strings...the rest of it is _necessarily_ still plain old ASCII,
though...it wouldn't generally be of benefit, as well as the
fact that it would be insanely difficult to draft a language
that actually works in all languages equally...so it's a matter
of "standardisation" - that every developer can read the actual
source code (hey, we are "open source", after all ;) - before
"political correctness", so to speak...but, for character
strings to be displayed to users and such, they necessarily have
to be in the program's targetted language (or languages, if
multi-lingual)...

Comments will also probably allow non-ASCII too...though there's
a worry from this that it might become unreadable to other
programmers, it's not really a tool's place to dictate what is
really a social decision...and, anyway, we've already seen code
posted to this very group with German and Italian and such
comments attached...so, programmers will do this regardless
(there are many ways to ASCII-fy other languages to strict ASCII
developed when there was no support...forcing to strict ASCII
doesn't directly equate forcing to the English language...so, it
would not actually work to do so...hence, it should be a _social
agreement_ to standardise things like comments being in English,
not something that the tool should enforce...though I await to
see Rene's implementation of the idea that spell checks for
French words and rejects anything not written in French...well,
English is probably "anti-assembly" because it is the language
spoken by the "Monsanto Nazis" who follow Bush and so on and so
forth ;)...so, I'd say, let it through in comments too...

The rest, though, can stay ASCII because there's the matters of
implementation and standardisation...and, anyway, it would be
just silly to, say, support the same directive written in a
hundred different languages...that would do _no-one_ any
favours, really...

Frank? Have you seen my "unified model" post yet? What do you
think about it? Other than, of course, the obvious "oh my
goodness! She's totally certifiably insane, for sure!" reaction
I always garner from dropping the most radically "leftfield"
ideas out of nowhere on people without warning! ;)...

C? Have you seen _anything_ I've posted? Because I've not gotten
any reaction to even the UTF-8 idea from you yet, let alone the
newer stuff...did you not get it or are still thinking about it
or am I just too insane that you're ignoring me, Hoping I'll
just disappear??? ;)

Is the mailing-list stuff playing up again, like it was doing
before?

Oh, yeah...I also second Frank: Nice work, Jeremy!

Beth :)