Re: Enhanced Unicode support for "Go" tools

From: Beth (BethStone21_at_hotmail.NOSPICEDHAM.com)
Date: 05/19/04


Date: Wed, 19 May 2004 18:38:28 +0100

Betov wrote:
> Wannabee wrote:
> > Hi. Can someone explain why unicode is needed at all? What I
have
> > "gotten" sofar, is that it makes possible using languages
with more
> > than the normal amount of chars. But does it have other
important
> > applications than this ?
>
> Well, first, it seems that Unicode is the default for the
> Api Functions A/W forms. Not that important, but, as i
> understand Jeremy Doc, when you call a "A" Type function,
> the OS 'translates' the concerned parameters, in order to
> call for the "W" Type form. Not that important IMO, at a
> timing point of view, but, having the "W" default, at
> least, indicates that we should not consider this a no
> use _added_ overhead... :)

Yes; In other words, Windows itself is fully UNICODE (NT-based
kernels, anyway...9x kernels are ANSI and don't handle the "W"
API)...the "A" functions are provided for "compatibility" with
the "obselete" ASCII encodings...

> The real point is that Unicode means something for all
> people on earth with an oriental language.

That's such a gross over-simplification, it's potentially
seriously offensive...

How are Greeks or Cherokees or Russians or Thai or Africans or
Indians or Eastern Europeans or...or...how are all these people
"orientals"?

UNICODE 4.0 covers:

Basic Latin (ASCII), Latin-1 supplement (so you can type your
Frenchy letters ;), Latin extended (A and B), IPA extensions,
spacing modifier marks, combining diacritical marks, Greek and
Coptic, Cyrillic, Komi, Armenian, Hebrew, Arabic, Syriac,
Thaana, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil,
Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan,
Myanmar, Georgian, Hangul Jamo, Ethiopic, Cherokee, unified
Canadian Aborginal syllabics (yes, note "unified"...there's many
of these but they've put them all together for simplicity),
Ogham, Runic, Tagalog, Hanunoo, Buhid, Tagbanwa, Khmer,
Mongolian, Limbu, Tai Le, Khmer symbols, phonetic extensions,
Latin extended additional, Greek extended, general
(linguistically common) punctuation, superscript and subscript
digits and symbols, currency symbols, combining diacritical
marks for symbols, letter-like symbols (such as the trademark
symbol or "No." written as one character), number forms
(including "vulgar fractions" and Roman Numeral characters),
arrows (112 of them, to be exact! ;), mathematical operators,
miscellaneous technical symbols (such as APL), control pictures
(visible versions of the ASCII control characters for display
purposes...in a text editor, for example), OCR symbols, enclosed
alphanumerics (letters and numbers inside circles and so forth),
"box drawing characters" (Yay! Much more diverse and versatile
than IBM's versions too ;), block elements (those "teletext"
block characters and "left five eigths block" filled in),
geometric shapes, miscellaneous symbols (e.g. weather symbols,
astrological / astronomical symbols, signs of the zodiac, hands
with fingers pointing to the compass directions, skull and
crossbones (in the "poison" meaning rather than "pirates ahoy!"
meaning ;), radioactive sign, "biohazard" sign, "caduceus",
ankh, religious and political symbols (cross, orthodox cross,
crescent, hammer and sickle, CND peace sign, yin yang), smileys
(and a "frownie" too ;), trigrams, chess pieces, suits from a
deck of cards, simple musical symbols, plastic recycling
categories, general recycling symbol (in black or white), dice
faces, warning and high voltage symbols...yes, this section is
in itself and impressive array of symbols), "dingbats" (won't
list them but the typical "dingbats" stuff you'd expect,
UNICODE's postscript dingbat compatible, anyway), more
miscellaneous symbols, more "supplemental arrows", Braille
patterns (yes, _Braille_...one presumes for _printing_ purposes
with a special Braille printer, as it would be quick useless to
print Braille on a monitor and expect blind people to run their
fingers over it or whatever ;), even more "supplemental arrows"
as if we haven't had enough already, miscellaneous mathematical
symbols and supplemental mathematical operators (that they
forgot to put into the other block, I guess...stuff like
"mappings", "union", "intersection" and "approximately greater
than" and so forth ;), even more "miscellaneous symbols and
arrows"...

...and only know do we reach anything "oriential" with Hiragana
and Katakana and Kangxi and "unified CJK (Chinese, Japanese and
Korean) ideographs" and stuff...before:

Yijing hexagram symbols, Yi syllables / radicals, Hangul,
alphabetic presentation forms, Arabic presentation forms (A and
B), variation selectors, combining half marks, CJK compatibility
forms, small form variants, half-width and full-width forms,
"Specials", Linear B syllabary / ideograms (ancient language,
called "Linear B" presumably because no-one knows what its name
actually was, so old 'tis ;), Aegean numbers, Old Italic, Gothic
(those last two are actual scripts...though "italic" and
"gothic" are also double used as ways to write other
scripts...different things...though, undoubtedly, the "style" of
one influenced the name of the other), Ugaritic, Deseret,
Shavian, Osmanya, Cypriot syllabary, Byzantine musical symbols,
(traditional) musical symbols, Tai Xuan Jing symbols,
mathematical alphanumeric symbols (you know, the mathematical
"bold" and "italic" letters used to represent things like "real
numbers" and so forth)...

..and then, yes, the rest is CJK "oriental" ideographs (of which
there's thousands of the things! ;)...

Plus, proposals to include historical scripts like Ancient
Egyptian Heiroglyphics are on the cards, ready for
5.0...Tolkien's fictious scripts were proposed but, apparently,
rejected (but then if you look at "Runic" then Tolkien's script
(one of the Elven scripts? I know there's supposed to be "high
elf" and "low elf" languages...for the "posh" and "common" elves
;) was basically very much based on other historical
scripts...but they took the idea of including them seriously, as
linguists may, indeed, look at Tolkien's scripts with academic
interest because they were _serious_ languages and a major
linguistical exercise, even if only applied for some fiction
books...in fact, Tolkien wrote the Lord of the Rings and so
forth because he believed that a language could not be divorced
from the culture that spawned it...hence, it - amazingly - works
backwards...he wrote the stories as "history" for his languages,
not the other way around...which is why they have such
credibility to them because they really are valid
languages...just not languages anyone actually spoke...made-up
languages, something like "Esparanto" but with more poetry to it
than that bland "uniform" joke of a language ;)...

Thankfully, I'm able to report that the inclusion of Klingon was
_REJECTED_...much to the annoyance of Trekkies...but to the
eternal delight and gratitude of everyone else! Although,
there's actually room for _millions_ more characters that they
should have included it just for the hell of it, I reckon...then
the Trekkies can have their own "Klingon only" newsgroups and
only annoy each other rather than rest of us! hehehe ;)...

> I have no
> idea about how these guys can (or can not) input Unicode
> Chars, say, in a [MyUString: U$ '.......', 0] String,
> with RosAsm Sources Editor.

They can't because you've not "UNICODE enabled" your editor...

As horrible as using a "rich edit" control may be, Microsoft
have included the required support (including interfacing to
their special keyboard input methods and stuff) for "rich edit"
to handle this all...though, it doesn't completely excuse a
program from not "UNICODE enabling" itself...in fact, those
editors that just use "rich edit" blindly to handle text? That's
the perfect test of them: Drop some Thai poetry into and try
"save"...see if they really are working properly ;)...

And, for the LuxAsm editor, I intend to make full use of my
"UNICODE 4.0" book to "enable" it as far as makes sense (well,
it is a fixed-width plain text source code editor, not a
full-blown word processor so it only needs to be implemented to
one of the lower UNICODE "compliance levels" ;)...Linux and
XFree86 themselves are already UTF-8 enabled, though...just a
question of using them properly to get all that
"internationalisation" (US: internationalization ;) working in
our favour, rather than against us...

If you want to allow this kind of thing into your editor, Rene,
then you've got to _program_ the support in...that's what I was
talking to with Frank and C on our mailing-list about at least
"preparing" the assembler to have this kind of thing added, if
not sticking all the support in from the beginning (it's NOT
that difficult, if you're always keeping one eye on making sure
it's all "compatible" from the very start...to think "UTF-8"
rather than "ASCII", for example ;)...

> We had a chineese user, in the past, who implemented
> the Uchar holding in the MainWindowProc, but, i do
> not even know if this works or not, and what conflict
> this could produce in the Assembler Parsers.

Yeah, it could "conflict" if you've not written RosAsm with this
in mind...but don't feel bad...other than Jeremy Gordon with his
announcement here, finding "UNICODE enabled" programming tools
is more rare than it really should be...you're in the same boat
as many tools...HLA has "made room" with a "wide character" data
type but that's about it (no "standard library" routines for it
or reading anything but ASCII, as far as I'm aware...HLA v2.0
any different, Randy? I could help you out there, Randy, in that
I have the big book and accompanying CD with all the needed
information...perhaps just a case of "make room" and I could
look at that later...licences prevent any direct "borrowing" of
anything I may manage with LuxAsm in this direction...but I
could "redo" it from scratch, so to speak...in the sense of
coding up UNICODE versions of your ANSI standard library
functions...this wouldn't be the same as anything in LuxAsm,
anyway, because we wouldn't have your library...and the
_knowledge_ of how to do it in my own brain is not under
copyright or GPL licence...so long as I start again - after all,
 only applying knowledge _UNICODE's_ standard, which isn't
LuxAsm-related at all - then it's not a licence breach)...

Do we surmise from the "had" - past tense - that perhaps not
supporting this stuff properly might have made your Chinese user
look elsewhere (like, perhaps, Jeremy's GoTools...which are
"real assembly" too, after all, and on your "recommendation
list" ;)?

> Best
> would be with writing a complete unicode Source
> Editor, but i have problems with this: 1) I can not
> do this work with a french keyboard,

Oh yes, you can...this is where Windows actually excels, for
once (well, "internationalisation" can also read as "world
domination" from a different perspective...Microsoft put the
effort in here because to rule all the world's languages is to
rule the world...*cue Sir Bill's "evil maniac" laughter echoing
into the distance* ;)...

[ The following is off XP...but if some other NT version, then
it's probably there but perhaps not quite in the same
places...near enough, though, that you should be able to work it
out yourself ;) ]

Click "Start" -> "Control Panel" -> "Regional and Language
support" (sorry, don't know what that would be in French but it
should be easy to work out...the same thing but, like, in the
French language, that knowing English and French, the
translation shouldn't be hard for you, where it would be
difficult for me ;)...

Right, a whole bunch of tabs...but look to the "languages"
tab...first, you'll see something like "install files for
complex script and right-to-left languages (including Thai)" and
"install files for East Asian languages"...switch both on...yes,
I know, you can't speak Thai or Japanese...but the point is get
the fonts and input support and such installed...you don't have
to actually type _logical_ or _meaningful_ Thai or Japanese just
to see if your editor can accept it and load / save and edit the
characters properly...just bang at the keys to put in some
random characters and then test editing and saving and stuff
(the rest will have to be down to any Thai or Japanese users
Emailing you with a "bug report" for the stuff that requires
some _knowledge_ about how to read and write these languages
:)...note that installing the files only "prepares" your machine
for drawing the fonts and writing text "backwards" in windows
(mind you, one interesting side effect is that, when switched
on, you'll now find that the "spam" sent from China and Japanese
appears in Outlook in its original format...yes, of no use
whatsoever but kind of funny and interesting, anyway ;)...

At the top is some "text services and input languages"...now the
fun begins...click on the "details..." button and up pops the
language and "input method" dialogue...leave your "default input
language" alone unless you want to permanently type in Tamil or
whatever rather than French...and here's the cool bits for you:

At the bottom is a "language bar" button...click this and turn
on the various options but, most importantly, the "show the
language bar on the desktop" button...right, a little window
appears...and you can drag that onto your taskbar (I have it
next to the "Start" button ;)...this little utility is exactly
what you need because then you can switch languages and
keyboards with a single click (it also has keyboard shortcuts
for changing too ;)...

Right, now you have the means to change languages and keyboards
and stuff in software with a single click...

Go back to the "Text Services and Input Languages"
window...you'll see "installed services" with an entry for each
installed language with the different keyboard types listed
underneath for that language...click on "add..." and then up
pops a dialogue where you can select a language and the
corresponding keyboard types to go with it...so, for instance,
select "German (Germany)" and "keyboard German" in the
dialogues...

By the way, no need to panic here at all...it does NOT replace
your current language or keyboard...they all exist _at the same
time_ and you can change back and forth between them using that
"language bar" you added to your taskbar eariler...so, really,
feel free to add on as many as you like (I have UK, US (QWERTY
and Dvorak layout), German, Japanese (I was interested in
learning Japanese before) and Russian (I added it when talking
to Maxim about Cyrillic on the OS development group and, well,
just left it there...no, have no idea how to speak Russian at
all ;)...

Just remember, of course, to make sure that the "default"
setting at the top is your normal setting ("French (France)" or
whatever, presumably ;)...so you don't get stuck with a Thai
keyboard or something...

Then, you're all set to test out your editor or any other
application in a whole bunch of different languages...and, no,
no need to pop out and buy a different keyboard (that would be
useful only from the perspective that the keys would actually
have pictures of what the keys are on them rather than
QWERTY...but, hey, we don't actually speak Greek, do we? We just
want to be able to type in the characters to see if the editor
will cope with it properly, is all :)...Windows will handle it
all in software, you see...okay, it's not a 100% perfect
emulation because you don't have the right keyboard (the
Japanese thing uses a "IME" input method thingy which is a
standard for using a QWERTY keyboard to type in Japanese things
:)...but it's good enough for testing any "UNICODE enabled"
RosAsm editor you might want to work on...

To test it out, you'll find that good old "Notepad" is hiding
mysteries you never knew it had...yup, Notepad has been fully
"UNICODE enabled" in the XP version (at least) of Windows...so
you can switch to the Notepad window, click on a different
language / keyboard on your new handy "language bar"
switcher...and then just bang the keys to see what
appears...yup, up popped some Russian Cyrillics when I just did
it...I've also got a text file around here which has the
Japanese Hiragana and Katakana in a table with the English next
to it...put it together when I started my whole "let's learn
Japanese" thing...haven't looked at it since, mind you...the
idea was supposed to be that I'd learn all the characters and
stuff by looking at the text file :)...

So, under Windows, you can do it with your French keyboard, no
problemo...

Also, of course, there's the slower and less convenient - but
just as possible - thing of loading up "character map" and
clicking randomly on some Chinese characters, copying it into
the clipboard and then try pasting that into your editor or
something...

> 2) I do not see
> any reason for doubling the size of all Sources,

Ah, well...that's Microsoft and Windows for you...or should that
be: That's ReactOS for you, copying Microsoft's implementation
100%, however good or bad it is? Whatever...as I try to
constantly point out, it doesn't particularly matter who's using
the stupid implementation while it still remains a stupid
implementation...

UTF-8 - the Linux method - is the same as ASCII for any ASCII
characters...so it's the same size as any ASCII file..._except_
if there are some Hebrew or Arabic characters in the source
(presumably in the character strings or comments or whatever
:)...those, though, tend not to go over 3 bytes at most for any
character in the BMP (which most of the supported languages
except for the "oriental" ones are housed :)...but more
typically the same 2 bytes that Windows takes anyway...

So, you could load and save in UTF-8 and then you'd find next to
no difference in file sizes...BUT the problem here is that
Windows uses 16-bit wide UNICODE...it doesn't do UTF-8
itself...so, you'd have to add in your own ANSI -> UTF-8 ->
UNICODE conversion routines and such...it's not as difficult as
it sounds, if you _really_ wanted to add the support
there...really, you are "open source" and I've repeated said I'd
be happy to help you out, if _you_ could drop the silly "you
talk to Randy, so you are evil!" nonsense...but, if that still
remains a problem for you then there some very good websites
about UNICODE and UTF-8...here's one that I reckon is good, as
it tries to be "comprehensive" (it's focussed on using UNICODE
and UTF-8 in _Linux_ - admittedly - but, of course, the
non-Linux bits apply to _ANY_ OS...you must pardon the Linux
bias but, well, LuxAsm is for Linux so you can guess why I
bookmarked this website rather than a Windows-specific one ;)...

http://www.cl.cam.ac.uk/~mgk25/unicode.html

> 3) all Open Sources should be written exclusively in
> english (so the unicode, in RosAsm should be only
> for Ustrings Declarations, IMO...)

This is a good point; Source code should be readable by any
other "open source" developer without problem...

But, think about it...you're not going to change the RosAsm
directives into Thai or Japanese, are you? And if you have some
rule about "identifiers can only have A..Z, a..z, 0..9,
underscore in them" then this doesn't change...

That is, just because you're accepting Hebraic characters
doesn't mean you have to translate RosAsm itself into Hebrew
keywords...a _PROGRAMMING LANGUAGE_ might be "influnced" by,
say, English (like most are...such as BASIC, Pascal, C, etc. as
well as Intel's standard "mnemonics" :)...but they _aren't_
actually English...the "language" of RosAsm is "RosAsm"...that
doesn't have to be translated at all...hence, all the source
code can remain strict ASCII and use English-based keywords and
so forth...

Where the UNICODE support would be used and useful is in
allowing, for instance, a Russian programmer making a program
with a Russian user interface to type in the _character strings_
(just "raw data" from the perspective of a program...passed
through from the source into the output file "as is" ;) in the
Russian language...

Now, of course, this is a touch inconvenient for non-Russian
speakers but, then, what else can you do? Insist that not only
all developers must speak English but that all users must speak
English and use only programs with English user interfaces? A
well-written "international" application would, of course, be
written in such a way that the program doesn't depend on what
the character strings actually are so they could be changed
easily...

But, then, RosAsm doesn't support "conditional assembly" yet,
does it? Perhaps you should concentrate on that first...because
this is where such a thing could be _very useful_...for example:

-----------------------------------------------

#ifdef FRENCH

    [ UserPrompt: "Ca va?", 0 ]

#elseifdef GERMAN

    [ UserPrompt: "Wie geht's?", 0 ]

#else // default: English

    [ UserPrompt: "How are you?", 0 ]

#endif

-----------------------------------------------

And then you can simply "#define FRENCH" or "#define GERMAN" at
the top of the source code (leave this out and it "defaults" to
English ;)...and then re-compile and the "conditional assembly"
will select out the right character string for that
language...so, then, you can use a single source file but
include character strings for lots of different languages...and
then it's as easy as "#define SPANISH" or "#define HEBREW" at
the top of the source code to switch between languages (command
line assemblers, of course, usually support some "define" option
on the command line that does a "#define" automatically while
compiling that you could select it from the command-line...but
as RosAsm works from the GUI only then you could have some
"defines" listbox or the programmer puts them manually in at the
top of the file (which isn't too difficult, anyway ;)...you
can't do what I tend to do with something like, which is to put
it all in a batch file and then invoke a batch of compiles with
each different option selected so that you can generate each
language's EXE with a single batch file...but that's one of the
things about going GUI without command-line support...it's no
big deal, though, really :)...

This is the kind of thing where "conditional assembly" could be
really useful to have perhaps _before_ multi-lingual
support...but, of course, I'm just talking this through...do
whatever you feel is right, of course...but, well, you know my
mantra: "I believe in choice...but a choice is not a choice
unless it's _INFORMED_"...merely providing the information so
you can choose what makes the best sense for you :)...

> So, my only poor plan about this would rather be to
> develop a Tool (for The [Tool] Menu), based on an
> Edit Control for translating Oriental Strings to flows
> of Words, to be pasted to a dWords Declaration instead
> of "U$"... Pathetic, as this would not enable oriental
> users with Unicode Comments, but simple and secure.
> (very temporary idea, anyway... :)

Yeah, well..."temporary support (to be continued)" is better
than nothing at all, right? :)

Anyway, allow me to refresh my memory here...Windows resources
are stored in the PE file in _UNICODE_, aren't they? And, in
Windows, if you were supporting many languages then the
"Microsoft approved" way is to create a "STRINGTABLE" for each
language and mark it with the right "language identifier" (then
Windows _automatically_ grabs the "STRINGTABLE" with the same
"locale id" as the machine is running or chooses "neutral /
neutral" if there's no strings that match in the resources :)...

So, really, what might make the most sense is to improve the
"string" resource editor to allow for a "language identifier" to
be attached to the strings...then, the same string ID is given
to more than one string (in different languages) but with
different "language IDs"...Windows will understand this to mean
that you have the same resource translated into all the
different languages and even handles automatically pulling out
the right version of the string by looking for matches of
"language ID"...this is the way Microsoft "approve" of doing the
"Multi-lingual internationalisation" thing, which is really why
they have the "string" resource to separate out the strings from
the program that they can be easily translated to other
languages to create "localised" versions of the same program for
users with different languages...

Mind you, if you took the "improve the string resource editor"
method here, then you'd need to re-write "B_U_ASM", which says:

"Strings in Resources are rarely used because they are at least
double size (Unicode storage + Resources tree headers + dummy
Strings, when required) and because they are slower access. So,
you should never consider Resources as a good place where to
store your Applications Strings. Usual Memory is the preferred
place for this."

Because you wouldn't want your RosAsm users to lose confidence
in you by "twigging" (sorry, that's slang: "working out /
realising", it basically means ;) that you never actually ever
understood what the "string" resources were really about, right?
And, also, in that same spirit, I also won't take you to task
over how you're kicking up a big fuss about a few bytes here and
there in character strings bloating file size...but whenever
someone shows RosAsm can't get the smallest file sizes, you
always tell us: "file sizes don't matter"...instead, I'll keep
quiet and await the day when your arguments will finally be
_consistent_ that we can actually consider discussing things
_properly_...

Beth ;)


Quantcast