Re: Unicode Support




Beth wrote:


>
> In fact, my little "test" which demonstrates that NASM _already_
deals with
> UTF-8 comments and strings, proves the point a different way...some
tools
> _already_ are "accidentally" supporting UTF-8 to that "basic level",
> without even knowing it...and the fact that no-one has actually
realised
> this (I didn't either until I thought it would be interesting to see
what
> NASM would actually do ;), shows how great the "demand" is...if
people were
> regularly wanting UTF-8 source files passed through NASM, then your
post
> should have had Frank shouting "NASM already does it!!"...but, I bet,
not
> even the NASM developers have actually realised that it does already
work
> to this "basic level"...indeed, they could cheekily add it to the
"features
> list": "NASM has basic UTF-8 support!"...as if they actually
"intended" it
> or something...shhh! Don't tell anyone! ;)...

Not knowing much about UTF-8 (my Unicode knowledge extends as far as
UTF-16 and that's about it), I would say that HLA v2.0 would handle
literal strings of this form as long as the character code for quote
can never appear in a MBCS (multibyte character sequence). HLA v1.x,
however, would not be happy with the character as Flex rejects all
character codes in the range $80..$FF out of hand.

There is, however, another issue that gets you into trouble with MBCS.
When you start adding sophisticated compile-time language facilities,
such as string functions, handling all the different character sets
becomes a nightmare. Then, in HLA's case, there is also the issue that
you need to provide standard library routine equivalents of string
functions for UTF-8 strings (you think zero-terminated strings are
painful to compute the length of? Try UTF-8!).

An assembler like NASM, that doesn't provide much in the way of
compile-time string handling, might actually get away with "accidental"
UTF-8 support. But when you've got a sophisticated macro system and
compile-time language, supporting MBCS turns out to be a *lot* of work.
Cheers,
Randy Hyde

.



Relevant Pages

  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: UTF-8 without external modules on Perl 5.0
    ... nothing about UTF-8 encoding/decoding in the stock modules of this ... so there is no way to have a character outside of the range ... So if you need to work with unicode strings in perl 5.005, ... verbatim in the script but make variables with their UTF-8 byte sequence ...
    (comp.lang.perl.misc)
  • Re: RfD: XCHAR wordset
    ... It's somewhat worse, because Windows has "A" prototypes, which convert the ... current code page into UTF-16 on the fly. ... Actually, it might be possible to change the current code page to UTF-8, but ... Windows strings are usually not C strings, ...
    (comp.lang.forth)
  • Re: Unicode in Regex
    ... index, length), using bytestrings and unicode regexp, verses native ... utf-8 strings in 1.9.0. ... *elegant* solution in 1.8., regexps or otherwise. ...
    (comp.lang.ruby)
  • Solved: What string encoding to pick as standard for a programming language?
    ... I decided for UTF-8 and started chainging the code before ... so using strings as byte vectors will never be ... part of a multibyte char happens to look like the simple char I am ... If or when I do a Linux version, which wxWidgets char width should ...
    (comp.lang.misc)