Re: Enhanced Unicode support for "Go" tools

From: Beth (BethStone21_at_hotmail.NOSPICEDHAM.com)
Date: 05/24/04


Date: Mon, 24 May 2004 13:35:38 +0100

C wrote:
> Beth wrote:
> [snip: UTF-8]
>
> Good idea, maybe when we start the translation from NASM to
Luxasm
> native source. Maybe worth a look at Digital Mar's D
language --
> it's parser is opensource and works for both UTF-16 and UTF-8

Cool; You can add on the other "UTFs", of course...but, under
Linux, UTF-8 is the most important because that's what the
kernel and most UNICODE-equipped tools (like Xterm) use...

> > ...LuxAsm would read in UTF-8 source files...BUT, no,
> > no suggestion of having Japanese Kanji identifiers
>
> ??????????? :-)
>
> I have thought of a way to make this possible (at least for
labels).
> You just need a way to be able to inform the assembler what is
a
> part of a label on what no -- currently it is
> [a-zA-Z._@$][a-zA-Z0-9._@$]
>
> Definied by a lookup table in scan.asm -- this table could be
> extended to match ranges of non ascii characters with little
> difficulty other than the selections of the ranges themselves.

Oh, indeed; But, from a practical point of view, do we want
Kanji or Arabic identifiers? Note that it's a complex
issue...for example, Arabic identifiers would need to be
right-to-left...with the rest of the LuxAsm source
left-to-right?? If you want "case insensitivity" then that's a
complex issue with some languages...or a Japanese programmer
releases his source code and all the identifiers are in Kanji
characters...non-Japanese speaking programmers, though, will
find this next to impossible to read...many characters look the
same and only differ in a few strokes here and there...the
Japanese eye is trained to see the differences but if there were
many, many identifiers that looked superficially all the same
and then there was a "jmp" instruction, how easy would it
practically be for a non-Japanese programmer to work out which
label it's jumping to?

I would have no problems with its inclusion from an
"internationalisation" stand-point but there's a whole arsenal
of _practical problems_...think of it from a _practical_
"standardisation" view-point...something everyone can deal
with...why English? Simple...it's one of the simpler ones
around, most programmers already have to deal with it and the
rest of the language - mnemonics, directives, etc. - are going
to be English-based, anyway...although, indeed, this could be
partially expanded to more like "Latin" and some of the accented
Latin characters in European languages could be included...

But if you're going to start allowing multiple scripts - which
are often significantly different from one another that coping
with all their various attributes side-by-side isn't a small
issue - then you are, on a practical note, likely to _piss off_
more people than you please...it's not a "xenophobic" thing - on
that note, I'd be 100% supportive and some Japanese source code
may help along my currently aborted attempts at learning some
Japanese (after all, I'm the one suggesting the UTF-8 support
;)- but simply a _standardisation_ thing...a _practical_
thing...is this a case of "political correctness" before common
sense, so to speak? The readability and useability of the source
code comes first...

And programming languages are NOT natural languages...they may
superficially "borrow" scripts and certain mnemonical bias from
a particular natural language but they are (synthetic) languages
of their own accord...hand some English-based NASM source code
to a non-programmer ordinary English speaker and it might as
well be written in Japanese for all the lack of comprehension
they will have of what they see...

"Remember Oppenheimer", so to speak...the "possibility" of being
able to do something does not automatically mean it's actually a
wise move to pursue it...

> > The rest, though, can stay ASCII because there's the matters
of
> > implementation and standardisation...and, anyway, it would
be
> > just silly to, say, support the same directive written in a
> > hundred different languages...that would do _no-one_ any
> > favours, really...
>
> Especially with the extension to the size of the executable.

Exactly; That's why I suggest that LuxAsm syntax itself stays as
simple "just ASCII"...character strings (and comments to a
degree) are either passed on "as is" or ignored by the assembler
so, indeed, no sense to disallow anything there...but for simple
_practical_ implementation and usage reasons, keeping the actual
source code as "mostly ASCII", so to speak, makes sense...as I
say, "political correctness" before common sense would, in fact,
tend to do _no-one_ any favours...

Perhaps, if you insisted (though there are a number of
implementation issues that this would introduce that's
complicated...as, for example, the possibility of right-to-left
mixed with left-to-right), then it could simply accept non-ASCII
characters in user identifiers and merely do a
character-for-character string compare...you _could_ even
support "case insensitivity" on those identifiers by encoding in
the standard UNICODE "case" information (would require a big
table and a number of "lower -> UPPER" routines or something to
support...also, UNICODE also has "Title Case" too...some
languages have no concept of "case" whatsoever too...so, there's
an extra two possibilities to throw on top of simple "lower" and
"UPPER" that needs to be supported...all of this is, of course,
possible...but you've got to ask yourself if the "cost vs.
benefit" balance here really is making sense?

At the very least, I'd suggest that we ignore such things at
first...just "leave room" for the possibility...character
strings and comments are "as is" or ignored so those can be very
easily dealt with straight away...but then leave the other
considerations as a "future extension"...I have the basic
details available here as to how to go about it so that's a
possibility...but, in a sense, it's _because_ I've read those
details that it should be "reserved for later"...with character
strings and comments - because of what they are - you can simply
treat "as is" or completely ignore...but when you start
including the non-ASCII characters into the actual "code stream"
itself then to treat all those languages and scripts and
"locales" properly as a coder in those languages / scripts would
expect adds on a big chunk of extra work...

UNICODE actually define, you see, "levels of compliance"...but
if you start allowing it anywhere, then it begins to start
having to consider things from the highest "international word
processor" angle...that's possible, of course, but it's a lot of
work...what I was proposing at first here was something
_practically useful_ to people who use other scripts that's a
doddle to add (as UTF-8 is "ASCII compatible" then, really, it's
much the same but for recognising non-ASCII and allowing it to
pass "as is" in character strings or just totally ignored inside
comments...oh - and for "compliance" and "security" - to reject
the so-called "overlong" encodings...that is, it's possible to
encode an ASCII character with the non-ASCII extended encoding
and, for security reasons (and sensible and simple processing
algorithms), UNICODE defines that these are "invalid" and should
be _rejected_ (they are, in a sense, possible but defined - for
security and implementation reasons - to be "invalid UTF-8"
:)...but they are all in a particular range of values so it's
easy to reject with a bunch of "if"-like constructs...

> > C? Have you seen _anything_ I've posted?
>
> Nope, the archive has not updated since 2004-03-30 and my
> email is very unreliable and drops a lot of posts. (Too much
> spam comming though -- I dread to think what the one I use
> for news groups gets, not that I ever check that one.)

Ah, that explains an awful lot...I was beginning to wonder if
the silence was a means of "protest" or something...but, yeah,
if you couldn't see the stuff, then, obviously, you couldn't
reply to it...it's because Frank was perfectly able to see them
that I was presuming you could too...but, as always,
"presumption is the root of all evil" ;)...

> > Because I've not gotten any reaction to even the UTF-8 idea
from
> > you yet, let alone the newer stuff...did you not get it or
are
> > still thinking about it or am I just too insane that you're
> > ignoring me, Hoping I'll just disappear??? ;)
>
> Not getting the posts (that is why I am posting stuff here
> instead of srcfrog). Anyway I have been a little too busy
> over the past week for coding, maybe I will get some time
> soon.

No, that's cool...just a case of getting worried when I write a
whole bunch of posts and then get no reaction at all...but,
indeed, if they weren't actually getting delivered to you in any
way, then that kind of explains the lack of reaction...

Beth :)