Re: Unicode Support
- From: websnarf@xxxxxxxxx
- Date: 19 Apr 2005 00:39:04 -0700
Chewy509@xxxxxxxxxxxxxxxx wrote:
> Something that has been bugging me, since I started on my own
> compiler/assembler is unicode support. Not the API's or libraries, or
> how unicode works (it's quite simple once you get your head around
> it),
Are you sure about that?
> but the fact that most assemblers (if not all assemblers), and most
> HLL Compilers (that I've used) still require the source code to be
> 8bit ASCII. (and the use of code-pages).
>
> So for Randy, Rene, etc, are there any plans to allow source code in
> UTF-16 format for your compilers/assemblers? eg to allow
> symbols/labels to contain non-ASCII characters, and allow easier
> unicode string support from within the source file itself?
I would *HIGHLY* recommend staying away from UTF-16 for raw source
data. Its properties are worse than those of UTF-8. UTF-8 gracefully
directly supports 7-bit ASCII as a proper subset.
OTOH, supporting string constants that are specified in UTF-16 using
some special mode like:
dw UTF16"Some data",0
or whatever you feel like (is there a useful standard here?) The point
being that the string would be stored in Windows compatible UTF-16.
UTF-16 is basically the "american" approach to Unicode (because it was
backed by Sun, Microsoft and IBM, but insanely, they thought 16 bits
was good enough for all characters -- which of course came as a bit of
a surprise to the Chinese/Japanese/Koreans; surrogates were added in
*afterwards* to deal with this), while UTF-8 is the more international
approach (although it was actually invented by Ken Thompson -- the
point of it is that it naturally allows for a much larger encoding
range than the original american Unicode, which better supports the
original ISO10646 standard which eventually just got folded into the
Unicode standard).
- UTF-16 requires a "BOM" at the beginning of any string to distinguish
endianness, which means that if you split a string into two pieces,
then transmit them, you end up gaining an additional "BOM" character
for the second piece.
- UTF-8 directly supports ASCII encoding as a sub-mode of its encoding.
I.e., normal ASCII encoded text is *already* UTF-8 compatible.
Certain ASCII functions like changing english text case, or searching
for ASCII characters can be done directly on UTF-8 data. So '\0'
termination, tabs, or things like CR and LF don't have strange
embodiments or representations, even when viewed with ASCII tools.
UTF-8 encodings are also easy to learn to recognize on sight, even with
ASCII tools.
- UTF-8 can be "resynched" even if a transfer channel is corrupted, or
if you start from the middle of the string after only a very short (I
think at most 5) character scan. Compare this to UTF-16, where if you
suddenly become offset by one character due to some flaw/error, you
will have no idea that your data is all corrupted for an unbounded
length of time.
Not surprisingly, Microsoft supports UTF-16 pervasively.
> <mini rant>
> Since we are now in 2005, most modern OS's support unicode, why
> do the base tools we use, are still insisting on ASCII source code?
> We all want the "viva asm revolution" to happen, but one thing IMHO
> that we are lacking is UTF-16 support for sourcecode. Would it give
> a one-up on common HLL's. Well I don't know, but it will make asm
> more accessible to more global users around the world.
> </rant>
Because Unicode support is harder than you think.
Read the documentation on normalization. This is important, because
comparison of two unicode strings does *NOT* reduce to simply comparing
the raw byte data. The Unicode encoding is actually redundant. I.e.,
obviously you have certain obvious uniqueness requirements for things
like labels and variable names (you can't declare the same label twice
in the same scope right?) So if someone renders the same unicode
string using two different code point sequences, then you have to have
correctly functioning equality testing (which is a little bit
complicated, and changes with each update of the Unidata.txt file from
the Unicode standard (which usually changes, if even slightly, with
every Unicode update).)
All this being said, Java, I believe, just supports UTF-16 based
Unicode right out of the box.
> PS. If your assembler already supports UTF-16 based source code, I
> would be deeply interested in hearing about some of the challenges
> in implementing unicode support. In particular, did you limit
> numbers to the western 0..9 figures, or did you allow other
> numbers to be included, eg arabic, many of the asian sets, etc.
> Did you limit to valid range of characters to the BMP (the first
> 64K characters only), or did you allow for the full range of
> characters (1024K characters) for labels. How did you handle
> compatible encodings, and combining characters? What about UTF-8
> vs UTF-16 vs UTF-32?
UTF-32 is mostly useful from a programming internal format. I.e., I
don't think supporting it for source code encodings is worth while at
all (since its so inefficient.) But for data encodings, I would
recommend supporting all three of them (since programmers may want to
use any of the modes in their programs).
> PPS. I know the DOS API doesn't support unicode strings, but just
> used it for the example.
Well that's actually a kind of non-trivial point. If you support
Unicode as datatype (no reason why you couldn't) there is the question
of what APIs do you intend to pass this data around in?
> PPPS. The full Unicode 4.1 spec can be downloaded as PDF's from
> www.unicode.org.
Yeah, so is version 4.0, 3.1, 3.0, ... etc. Taking a step back, one of
the real problems with Unicode is that its rate of evolution is
unusually high for such an important and universal standard.
--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/
.
- Follow-Ups:
- Re: Unicode Support
- From: Chewy509
- Re: Unicode Support
- References:
- Unicode Support
- From: Chewy509
- Unicode Support
- Prev by Date: Re: Unicode Support
- Next by Date: Re: RosAsm is a broken pile of crap
- Previous by thread: Re: Unicode Support
- Next by thread: Re: Unicode Support
- Index(es):
Relevant Pages
|