Unicode Support



Hi Everyone,

Something that has been bugging me, since I started on my own
compiler/assembler is unicode support. Not the API's or libraries, or
how unicode works (it's quite simple once you get your head around it),
but the fact that most assemblers (if not all assemblers), and most HLL
Compilers (that I've used) still require the source code to be 8bit
ASCII. (and the use of code-pages).

So for Randy, Rene, etc, are there any plans to allow source code in
UTF-16 format for your compilers/assemblers? eg to allow symbols/labels
to contain non-ASCII characters, and allow easier unicode string
support from within the source file itself?

eg to be able to support source code like this:

<code = FASM>
org 100h

старт:
mov ax, 9
mov dx, шнур
int 21h
ret

шнур du "여보세요 세계"
db "$"
шнур2 du "Γειάσου κόσμος"
db "$"
</code>

Obvisously all directives and operands should remain as they are (in
english as defined by Intel/AMD), but would be nice to have true
support for userdefined labels and strings.

<mini rant>
Since we are now in 2005, most modern OS's support unicode, why do the
base tools we use, are still insisting on ASCII source code? We all
want the "viva asm revolution" to happen, but one thing IMHO that we
are lacking is UTF-16 support for sourcecode. Would it give a one-up on
common HLL's. Well I don't know, but it will make asm more accessible
to more global users around the world.
</rant>

PS. If your assembler already supports UTF-16 based source code, I
would be deeply interested in hearing about some of the challenges in
implementing unicode support. In particular, did you limit numbers to
the western 0..9 figures, or did you allow other numbers to be
included, eg arabic, many of the asian sets, etc. Did you limit to
valid range of characters to the BMP (the first 64K characters only),
or did you allow for the full range of characters (1024K characters)
for labels. How did you handle compatible encodings, and combining
characters? What about UTF-8 vs UTF-16 vs UTF-32?

PPS. I know the DOS API doesn't support unicode strings, but just used
it for the example.

PPPS. The full Unicode 4.1 spec can be downloaded as PDF's from
www.unicode.org.

PPPPS. I use jEdit as my preferred text editor. (It's pure java so
should run on any java enabled platform, and supports UTF-16 natively).

.



Relevant Pages

  • Re: RichView Package 1.8
    ... > The unicode support has to be enabled in the File menu under Options. ... Unicode characters do not have charsets (well, it depends on what do you ...
    (borland.public.delphi.thirdpartytools.general)
  • Re: =?ISO-8859-1?Q?Soup=E7on_of_cedilles_and_aper=E7us?=
    ... Note that most of these support only ASCII ... plus a few accented characters, ... Unicode supports, in principle, the characters of every written ... The catch is that not all newsreaders are new enough to ...
    (alt.usage.english)
  • Re: Unicode Support
    ... > compiler/assembler is unicode support. ... > but the fact that most assemblers, ... Most popular *editing* tools support this data format. ... Is there really a need to put unicode characters into identifiers? ...
    (alt.lang.asm)
  • Unicode text editor mined 2000 release 11
    ... Mined provides both extensive Unicode and CJK support offering many ... of terminal variations, or Han character information). ... It was the first editor that supported Unicode in a plain-text terminal. ...
    (comp.os.linux.announce)
  • Re: Unicode-based FreeBSD
    ... displaying specialised characters on the screen/tty. ... There are special Input Methods for the rest of Unicode. ... Unicode support and the FreeBSD developers see little reason to ...
    (freebsd-current)