Re: Unicode Support



websnarf@xxxxxxxxx wrote:
> Chewy509@xxxxxxxxxxxxxxxx wrote:
> > Something that has been bugging me, since I started on my own
> > compiler/assembler is unicode support. Not the API's or libraries,
or
> > how unicode works (it's quite simple once you get your head around
> > it),
>
> Are you sure about that?

Having read the spec a few times, it doesn't seem that hard?
(combinations and normalisations do seem to be the most complex issues
though).

> > but the fact that most assemblers (if not all assemblers), and most
> > HLL Compilers (that I've used) still require the source code to be
> > 8bit ASCII. (and the use of code-pages).
> >
> > So for Randy, Rene, etc, are there any plans to allow source code
in
> > UTF-16 format for your compilers/assemblers? eg to allow
> > symbols/labels to contain non-ASCII characters, and allow easier
> > unicode string support from within the source file itself?
>
> I would *HIGHLY* recommend staying away from UTF-16 for raw source
> data. Its properties are worse than those of UTF-8. UTF-8
gracefully
> directly supports 7-bit ASCII as a proper subset.
>
> OTOH, supporting string constants that are specified in UTF-16 using
> some special mode like:
>
> dw UTF16"Some data",0
>
> or whatever you feel like (is there a useful standard here?) The
point
> being that the string would be stored in Windows compatible UTF-16.
>
> UTF-16 is basically the "american" approach to Unicode (because it
was
> backed by Sun, Microsoft and IBM, but insanely, they thought 16 bits
> was good enough for all characters -- which of course came as a bit
of
> a surprise to the Chinese/Japanese/Koreans; surrogates were added in
> *afterwards* to deal with this), while UTF-8 is the more
international
> approach (although it was actually invented by Ken Thompson -- the
> point of it is that it naturally allows for a much larger encoding
> range than the original american Unicode, which better supports the
> original ISO10646 standard which eventually just got folded into the
> Unicode standard).

All encoding methods, whether that be UTF-8, UTF-16 or UTF-32 handle
the same number of encodings, (1024K), just the way that they are held
in memory/disk is a little different. I wouldn't call UTF-8 "the more
international approach", but rather "the lets add backwards
compatiblity for the those that still believe ASCII is the only real
encoding format approach".

> - UTF-16 requires a "BOM" at the beginning of any string to
distinguish
> endianness, which means that if you split a string into two pieces,
> then transmit them, you end up gaining an additional "BOM" character
> for the second piece.

A BOM is not necessary, when the encoding is known or dictated by
implementation. Since assemblers generally operate on the target CPU (I
know there are exceptions, such as cross-compilers and
cross-assemblers), it can be fair to say that the format can be
dictated by the assembler, whether that be UTF-16BE or UTF-16LE (LE for
x86).

> - UTF-8 directly supports ASCII encoding as a sub-mode of its
encoding.
> I.e., normal ASCII encoded text is *already* UTF-8 compatible.
> Certain ASCII functions like changing english text case, or searching
> for ASCII characters can be done directly on UTF-8 data. So '\0'
> termination, tabs, or things like CR and LF don't have strange
> embodiments or representations, even when viewed with ASCII tools.
> UTF-8 encodings are also easy to learn to recognize on sight, even
with
> ASCII tools.

Granted.

> - UTF-8 can be "resynched" even if a transfer channel is corrupted,
or
> if you start from the middle of the string after only a very short (I
> think at most 5) character scan. Compare this to UTF-16, where if
you
> suddenly become offset by one character due to some flaw/error, you
> will have no idea that your data is all corrupted for an unbounded
> length of time.

Are you sure?

UTF-16 is easier than UTF-8 (and only requires 1 backstep at most), and
this doesn't apply to UTF-32 at all. The way surrogate pairs work in
UTF-16, it's just a quick test to see if a bit is 1 or 0 to determine
first or second character.

UTF-16 encodes 0-FFFF into 1 word, and 10000-10FFFF into 2 words as:

110110wwwwxxxxxx 110111xxxxxxxxxx

where wwww = top 4 bits - 1
and xxxx = lower 16 bits.

so if the first 6 bits are 110110 then we have the first of a pair,
else if the 6bits are 110111 then we have a second unit of a pair. Any
other encoding on the 6bits tells me I have a normal character.

The two ranges covered by the surrogate pairing used in UTF-16 are
reserved ranges within the character set, so there can be no overlap
with assigned encodings.

> Not surprisingly, Microsoft supports UTF-16 pervasively.

Surprising MS adopted Unicode *before* Unicode 1.0 was finalised.
(Which is what lead to some of the quirks of the MS Unicode
implementation).

<snip>

> > PS. If your assembler already supports UTF-16 based source code, I
> > would be deeply interested in hearing about some of the challenges
> > in implementing unicode support. In particular, did you limit
> > numbers to the western 0..9 figures, or did you allow other
> > numbers to be included, eg arabic, many of the asian sets, etc.
> > Did you limit to valid range of characters to the BMP (the first
> > 64K characters only), or did you allow for the full range of
> > characters (1024K characters) for labels. How did you handle
> > compatible encodings, and combining characters? What about UTF-8
> > vs UTF-16 vs UTF-32?
>
> UTF-32 is mostly useful from a programming internal format. I.e., I
> don't think supporting it for source code encodings is worth while at
> all (since its so inefficient.) But for data encodings, I would
> recommend supporting all three of them (since programmers may want to
> use any of the modes in their programs).

I agree. UTF-32 for source would be a waste. Internally, it would help
somewhat, but I don't think the overhead for the <1% of cases would be
worth it. The Unicode standard does allow for implementations only to
handle the BMP (U+0000 .. U+FFFF), and still be conformant. Maybe
that's an option?

> > PPS. I know the DOS API doesn't support unicode strings, but just
> > used it for the example.
>
> Well that's actually a kind of non-trivial point. If you support
> Unicode as datatype (no reason why you couldn't) there is the
question
> of what APIs do you intend to pass this data around in?
>
> > PPPS. The full Unicode 4.1 spec can be downloaded as PDF's from
> > www.unicode.org.
>
> Yeah, so is version 4.0, 3.1, 3.0, ... etc. Taking a step back, one
of
> the real problems with Unicode is that its rate of evolution is
> unusually high for such an important and universal standard.

However the standard does make references for further feature
compatibility. Eg any implementation that is Unicode 3.0 conformant,
will also be Unicode 4.x conformant. While it is an issue, I don't
believe there is enough risk evolved to warrant too much time on it.

Darran (aka Chewy509).

.



Relevant Pages

  • Re: How to Get the ByteLength from CString when it is Unicode
    ... UTF8 is one of many MBCS encodings. ... Unicode is not an MBCS; UTF8 is (or at least the WideCharToMultiByte API call thinks it ... The number of characters is based on interpreting 'character' as WCHAR in Unicode and CHAR ...
    (microsoft.public.vc.mfc)
  • Re: RichView Package 1.8
    ... > The unicode support has to be enabled in the File menu under Options. ... Unicode characters do not have charsets (well, it depends on what do you ...
    (borland.public.delphi.thirdpartytools.general)
  • Re: =?ISO-8859-1?Q?Soup=E7on_of_cedilles_and_aper=E7us?=
    ... Note that most of these support only ASCII ... plus a few accented characters, ... Unicode supports, in principle, the characters of every written ... The catch is that not all newsreaders are new enough to ...
    (alt.usage.english)
  • Re: Unicode Support
    ... > compiler/assembler is unicode support. ... > but the fact that most assemblers, ... Most popular *editing* tools support this data format. ... Is there really a need to put unicode characters into identifiers? ...
    (alt.lang.asm)
  • Unicode text editor mined 2000 release 11
    ... Mined provides both extensive Unicode and CJK support offering many ... of terminal variations, or Han character information). ... It was the first editor that supported Unicode in a plain-text terminal. ...
    (comp.os.linux.announce)