Re: Fast UTF-8 strlen function



Randy wrote:
> BTW,
> How do I print a UTF-8 string to the console in Linux? Does Linux
> automatically handle UTF-8 or is there a special API or handle I have
> to use?

No, "xterm" automatically handles UTF-8...just "write" it to "stdout" in
the usual way...

[ There is some "locale" setting for "xterm" which selects what character
set is in use...BUT this is usually set to "UTF-8" by default...so, it
already should be set to "UTF-8"...really, since UTF-8 support has been
added, these other character set settings are just for "compatibility"... ]

I'm pretty sure that the "pure text-mode" console doesn't handle it
directly, though (mind you, I say this without ever actually having tested
it, I admit)...but that's a technical thing, of course, relating to how the
text-mode characters are stored and rendered...but "xterm" is okay because
it's "UTF-8 enabled" and running in a graphics mode to actually be
technically able to render all those characters...

Did you see my "Japanese Hello World" example? It was in NASM and literally
was not an ounce different to an ordinary Linux "Hello, world!" example,
except that the comments and the text in the string to be printed were in
Japanese (thanks to Babelfish and a few Japanese websites for the
translations! ;)...

But, as I say, otherwise, there was absolutely no difference...I just sent
a UTF-8 string to "stdout" using the plain old "write" function...and run
under "xterm", the characters appeared (actually, SOME of the characters
appeared - the "Hiragana" characters were okay but not the "ideographs" -
but this was a case of not having the fonts installed: The program itself
and xterm was working correctly...just I need some better Japanese fonts,
if I want to write anymore Japanese programs, it seems ;)...

XFree86 (and most of its standard applications like "xterm") has had UTF-8
support added...basically, it's much like Windows in that it's now all
"native"...just that it's the UTF-8 encoding that's "native" here, not the
"16-bit per character" encoding (I hesitate to say "UTF-16" in this case
because shouldn't Windows actually handle characters > 0FFFFh with that
"surrogate character" nonsense to technically be "UTF-16"? Chewy reports
that Windows doesn't actually do that and simply "crops" things off at
0FFFFh...a kind of "crippled UTF-16", it seems)...

In fact, grabbing it back from Google Groups, here's the program again:

------------------------------------

global _start

section .text

_start:
; ??????????????
;
mov eax, 4 ; 'sys_write' ???????
mov ebx, 1 ; ??????
mov ecx, HelloString ; ?????
mov edx, HelloLength ; ?????
int 80h ; Linux ??????????????

; ???????????
;
mov eax, 1 ; 'sys_exit' ???????
xor ebx, ebx ; ?????
int 80h ; Linux ??????????????

HelloString db "????????!", 10
HelloLength equ $ - HelloString

------------------------------------

....or, at least, I'm going to send this UTF-8 to the newsgroup...and we'll
see whether the program makes it across "alive", eh?

I used HTML last time to ensure preservation of the characters BUT
Microsoft's implementation of HTML posts is awful: It duplicates the post
_twice_...once as "plain text" and once as HTML (presumably so non-HTML
browsers can read the plain text? Not that this would be at all useful in
this context)...

Also, last time I tried sending UTF-8, I'm suspicious that I simply didn't
have it all set up correctly, in fact...so, this will confirm whether it
was just "bad settings" or whether, indeed, I can't actually post UTF-8 to
the group because some propogation software along the way "mangles" it
:)...

Oh, and I've changed it a little because "write" clearly must take "count
of bytes", not "count of characters"...though, I've not actually tested the
code (but, well, the "length trick" has worked before, so unless I've
typo'd or something silly like that here, it should work again, without me
needing to explicitly test it out :)...

But, as you can see, I'm not doing anything "special" in this
program...it's a normal simply Linux "Hello, world!", except for the
Japanese string and comments...

Indeed, basically, I had no idea if it would work or not...but, I thought,
"only one way to find out!" and plunged into writing the program "blind"...

And, yeah, NASM "ate" the UTF-8 strings just fine...and "xterm" printed out
the Japanese just fine too (well, accepting the "missing fonts" issue on
the "World!" part (which is ideographic, not alphabetic like the "Hello"
part is :)...but it correctly was showing the right number of "missing
glyph" boxes, so it _was_ interpreting those characters correctly...just a
case of "missing fonts" there :)...

Nothing "special" required at all...

????,
Beth :)

P.S. If some of these characters don't transfer, apologies...I've found a
new "settings" in my newsreader for UTF-8...so, though my last "test"
failed miserably, I _MIGHT_ - just might - have found the "magic option" to
make it work...well, one can but Hope, anyway :)...



.