Re: Fast UTF-8 strlen function
- From: "Beth" <BethStone21@xxxxxxxxxxxxxxxxxxxxxxx>
- Date: Thu, 12 May 2005 00:29:01 GMT
Randy wrote:
> Sevag Krikorian wrote:
> > If you're going with new library routines, why not use UTF-32
> instead?
>
> UTF-32 is an *okay* internal format, but AFAICT it's not widely
> accepted as an external format.
Yeah; Basically because it's "too fat"...4 bytes per character is a bit
"steep" a price to pay, especially if you're not interested in the oriental
ideographs because that's about all that's up there in the >64K range...so,
that's, like, 2 "pad bytes" on every character...
Worse, UNICODE say there are only interested to ever go up to some 2^20
characters, anyway...so, that's over a byte per character used for
absolutely nothing but "alignment", really...
[ Indeed, it isn't one of the UNICODE standard encodings, in fact, but
perhaps a UTF-24 - three bytes per character - could be added? Yeah, that's
a little "unusual"...but don't forget that pixels are already 24-bit in
"true colour" modes...so, "unusual" but not unheard of... ]
The "simplicity" of UTF-32 isn't worth the cost, in an "external"
context...well, not unless you're using a lot of Chinese or Japanese
ideographic characters...perhaps it's "all the rage" to use UTF-32 in the
Far East or something...I don't know...never been there (well, I have been
far enough East but further South and not in the particular countries -
China / Japan / Korea - in question :)...
> Eventually, I do plan to add new library routines to HLA (in v2.0) to
> support all this stuff, but right now I just need to add appropriate
> support to HLA v2.0 (and the ADK) itself. I'm working under the belief
> that if I can support UTF-8 inside HLA v2.0, everyone will be happy
> (e.g., if someone actually produces a UTF-16 text file, it can be run
> through a filter to produce a UTF-8 file).
Same presumption, basically, we're making with LuxAsm...
And, of course, as you need to "balance" things between Windows and Linux
equally with HLA, then UTF-8 would, I think, be the right "compromise"
solution...it should be perfectly okay, I reckon...because, note, I did
actually use Windows' Notepad to do that "Japanese Hello World"...you can
choose to save files as "UTF-8" easily enough with Notepad...so, this is
not really any great problem with Windows (even if UTF-16 is "native", most
Windows programs that have UNICODE support - such as Notepad and
newsreaders and such - typically also include a "save as UTF-8" option
:)...and UTF-8 is what's "native" on Linux...plus, that "ASCII
compatibility" is a handy feature...
And, as you say, you can really use any of the encodings and then just pass
it through a "filter" to convert it before processing it...
> When it comes time to actually produce a set of standard library
> routines for HLA v2.0, I suspect that having a set of UTF-7.5, UTF-8,
> UTF-16, *and* UTF-32 routines wouldn't be a bad idea.
Yeah; Especially if you want "optimised" routines, as each encoding is
probably "optimised" best in differnt ways (you know, with UTF-8, you can
grab four bytes at a time to speed things up...but, this, of course, is
just one character for UTF-32, so there's no "optimising" in doing this for
that encoding ;)...
The only real problem, of course, is the "amount of hard work"
involved...basically, "redoing" the library routines for each
encoding...ah, it's the old "trade-off": Quick programming or best
programming...the "cheap and cheerful" way is to just have UTF-32 or
something and then add in "filters"...the better "performance and
flexibility" option is to have a set of routines for each encoding...the
former's less work for you but the latter's less work for your users...and,
really, a tool author should, of course, be thinking more of their user's
"convenience" than their own, generally speaking...
And, anyway, all the encodings may store things differently but it is the
same actual content - UNICODE "code points" - that they are storing...the
"look-up tables" for "case" or whatever will be equally useful for all the
encodings...and so forth...so, it's not as if you'd need to completely
"re-invent the wheel" for each encoding...and you could always do what the
HLA library already does with the arithmetic: "Promote" upwards, do the
maths, convert back down again...so, convert to UTF-32, process, convert
back to UTF-16 or UTF-8 or whatever ;)...
Beth :)
.
- Follow-Ups:
- Re: Fast UTF-8 strlen function
- From: Sevag Krikorian
- Re: Fast UTF-8 strlen function
- References:
- Fast UTF-8 strlen function
- From: randyhyde
- Re: Fast UTF-8 strlen function
- From: Sevag Krikorian
- Re: Fast UTF-8 strlen function
- From: randyhyde
- Fast UTF-8 strlen function
- Prev by Date: Re: Fast UTF-8 strlen function
- Next by Date: Re: RosAsm Team is Still Making Excuses
- Previous by thread: Re: Fast UTF-8 strlen function
- Next by thread: Re: Fast UTF-8 strlen function
- Index(es):
Relevant Pages
|