Re: Unicode Support



Beth wrote:
> Chewy wrote:
> > Hi Beth,
> > Some of the other issues, such as non-english labels and
> > numbers, can be a pain in the arse. Numbers not so much (AFIAK most

> > of the world recognises the 0..9 numerical set),
>
> Well, sort of...Tamil has characters for "ten", "one hundred" and
> "one thousand", as well as _nine digits_ (no zero!! They've not been
> watching that Stargate episode! ;)...but, yeah, numbers aren't so
> bad...

There are many different numbers in different alphabets.

> > but labels. The biggest issue for me, would be combinations or
> > equivalence. How should those be handled?
>
> There are "rules" for "normalising" strings in the UNICODE book...in
> fact, some UTF-8 sequences are actually delibrately "invalid" so that

> only one way to access a character is allowed and "overlong forms"
> (such as using an "escape sequence" but accessing an ordinary ASCII
> character) are considered "invalid"...

These are just code point representation rules. Beyond that, there are
many illegal, and unassigned code points. Also there are nonsensical
code point expressions like just a singleton "grave" accent (no
language accepts that alone as a character in their alphabet).

> Follow these "rules" to put the string into a "normalised" form (that

> is, there is only one valid way to address any particular character
> :) and then there should be no problems with things like testing for
> equivalence between two strings...

Uhh ... normalized means something different. Unicode normalization is
not the only way a Unicode string should exist, its just a theoretical
form from which you can compare strings. There are two main endorsed
normalization forms -- compatible and canonical. They specify a
mechanism for rearranging and mapping of code point sequences to a more
unified version.

For example, there are letters which can take multiple accents. And
you can often specify them as base-char + accent1 + accent2. The
question is, if you flip the order of the accents, does that represent
a different character or not? The unicode normalization algorithms say
no for some cases, and yes in others.

> > Well, it appears that the answer (for LuxAsm) is don't. Just
> > have labels/identifiers in the normal 7bit ASCII range as already
> > defined by most other assemblers/HLLs.
>
> Basically, yes...
>
> Though, in principle, I don't have a problem with the idea of also
> applying it to the labels / identifiers too...but, in practice, a
> lot of effort for what exactly?

A lot of effort for cornering every market outside of english. If
programmers from non-english speaking countries could code in their own
native tongue, it would probably be very attractive to them.

> In fact, my little "test" which demonstrates that NASM _already_
> deals with UTF-8 comments and strings, proves the point a
> different way...some tools _already_ are "accidentally" supporting
> UTF-8 to that "basic level", without even knowing it...

Well that's kind of the point of UTF-8. For certain semantics, you get
it for free, simply by allowing characters in the range 0x80 - 0xF8 in
your tools. But of course, comments, and embedded strings in this case
are easy because, comments don't need any checking, and assembly
strings are essentially byte sequences which just correspond to raw
data (legality or not is not something that needs to be imposed or
checked).

> [...] and the fact that no-one has actually realised
> this (I didn't either until I thought it would be interesting to
> see what NASM would actually do ;), shows how great the "demand"
> is...if people were regularly wanting UTF-8 source files passed
> through NASM, then your post should have had Frank shouting
> "NASM already does it!!"...

Well hang on. If you can't make identifiers rendered in UTF-8 (or
possibly UTF-16) with proper normalization, then I wouldn't call that
real unicode support.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

.



Relevant Pages

  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: Need help on string manipulation
    ... better to convert strings to UCS-32 before manipulation? ... Characters represented by wchar_t must use one wchar_t per character, ... which may use a multibyte encoding. ... use some newer Unicode characters, if this is a problem for you, then ...
    (comp.lang.c)
  • Re: Copying string to byte array
    ... of Strings and the CryptEncrypt + CryptDecrypt APIs. ... binary data should not be held in String variables. ... a) not all character codes are valid in a given ...
    (microsoft.public.vb.general.discussion)
  • Re: GetOpenFilename()
    ... Specifies that the File Name list box allows multiple selections. ... the directory and file name strings are NULL ... You advance the pointer one character more. ...
    (microsoft.public.vc.language)
  • Re: ToUpper() Better solution
    ... Looking into it, I see a simpler method: After normalization, use ... characters (which will be the accents) with the empty string. ... every possible character, where it defaults to the Unicode character, ...
    (microsoft.public.dotnet.general)