Re: Unicode string libraries
- From: Mark Wooding <mdw@xxxxxxxxxxxxxxxx>
- Date: Sun, 4 Jan 2009 20:51:21 +0000 (UTC)
Paul Hsieh <websnarf@xxxxxxxxx> wrote:
Well actually I think its a "Linuxism" to use UTF-8. Or at least a
non-Sun UNIXism.
If anything, it's an IETF-ism. UTF-8 is the encoding that must be used
in IETF standardized protocols if there's no space in the protocol for
encoding negotiation. This basically makes it closest thing I can think
of to an `Internet standard text encoding'.
UTF-8, on average, tends to have the best memory usage of all the
formats (*unless* you are dominated by Asian characters defined within
the BMP[1]). In particular, it rules on ASCII, of course. Not
surprisingly it was the path Perl took.
I was quite surprised given the amount of fairly complex text hacking
that Perl does. I initially thought that the variable-length characters
would have been a major performance hit. Then I realized that Perl
doesn't actually process individual characters very much at all.
[UTF-8 and Boyer--Moore.]
Indeed, but its a technical solution that's not all that useful in
practice. To determine if two unicode strings binary comparisons are
typically insufficient.
That depends on your standpoint on canonicalization. If you've either
canonified both strings first, or declared that the whole messy business
is the problem of someone else who actually cares[1], then this isn't a
problem.
[1] Not such a bad idea, actually. I remember the Git revision control
system coming unstuck (a while ago) on Mac OS X because the
filesystem magically canonified UTF-8 filenames, causing the
revision control system and filesystem to end up with different
ideas of what files were called. If the file system had decided to
Leave Well Enough Alone Dammit, and stop trying to be so clever,
there wouldn't have been a problem.
Similar problems used to occur in case-insensitive file systems on
multi-user servers, of course, when different users would want to
use different character sets to view the filesystem...
Well, the same is true of UTF-16 so long as you have matching
endianess.
Oh, don't get me started on the endianness idiocy of UTF-16 as an
interchange format!
The real benefit of using UTF-8 is that you can perform any transform
on it that works on ASCII embedded in 8-bits that preserves non-ASCII.
So SIMD tricks that change the case of ASCII, for example, can work
without modification.
There's that, but also that UTF-8 didn't break when Unicode got extended
to 20.09 bits.
Well this criminal weakness actually started from the Unicode standard
itself, remember. They thought 16 bits would be good enough to cover
everything, without even bothering to count the number of Asian
characters.
I remember thinking at the time that 65536 was going to be a tight
squeeze.
But that's not what I was describing as `criminal'. Rather, the problem
Java had was that it couldn't later extend its `char' type to cover the
wider ISO10646 space, because it had already been specified as an
unsigned integer type consisting of precisely the integers 0, 1, ...,
65535, with defined arithmetic properties which would break if the set
of represented values was altered. (Ummm... actually, it might have
been possible to define an extended char as being an element of a
product ring, but that's just a little crazy.)
I can't blame Sun for adopting Unicode early: someone had to, or it
would never have got off the ground -- and although Unicode is pretty
awful in a large number of ways, it's still a hell of a lot better than
the vast pile of bizarre stateful national character sets which preceded
it. But I can -- and do! -- blame them for mis-specifying the `char'
type.
Had Sun decided to use an abstract type for characters, with conversion
functions to integers (and maybe handy methods for doing character-y
things like case conversion and so on -- but no, because primitive types
can't have proper methods because ...) then they wouldn't have been
screwed so badly by the Unicode expansion.
Of course, the real criminal incompetents here are Microsoft, who --
having seen Sun getting screwed by Unicode and an insufficiently
abstract character type -- decided that it looked like a wonderful idea
and did exactly the same thing in C#. This is actually rather puzzling,
because C# does seem to avoid most of the screamingly absurd mistakes
that Java made.
Common Lisp implementations have adopted (20.09-bit) Unicode without
doing violence to the language spec, precisely because Lisp treats
characters as an abstract type. Of course, programs which assume that
arrays of size CHAR-CODE-LIMIT are cheap will run out of memory on such
implementations, but nothing in the language spec said that this was a
portable thing to do.
I pick Lisp just because I'm familiar with it. I'm sure there are other
old languages which have adopted Unicode without much pain. Well, Perl
and Python did, certainly. C suffers because it failed early on to
distinguish `char' from `byte' properly, causing a tension when `char'
wanted to expand and `byte' certainly couldn't without wrecking
compatibility with too many old programs; but char as a holder for UTF-8
and wchar_t as a holder for Unicode code points seems both sane and
fairly common practice.
This hack to use surrogates to cover a much larger range
caused the UTF-8 range to get truncated to accommodate for UTF-16's
limits (so that they could unify Unicode and ISO 10646).
Huh? UTF-8 doesn't encode surrogate pairs: the code points used for the
surrogates are just not valid in UTF-8 at all.
Sun and Microsoft's main problem is that they jumped on the Unicode
ship too early (or perhaps they were part of the problem -- I am not
*that* familiar with the early history of Unicode).
Windows NT and Java both jumped on too early, and got burned by the
later expansion. C# was designed /after/ the expansion, but suffers
from stolen braindamage. (I don't have a problem with borrowing good
language features, and Java does have a few. I have a massive problem
with borrowing stupid features, and the overspecified 16-bit-integer
`char'-which-can't-actually-represent-characters is definitely one of
those.)
Interestingly, Plan 9 from Bell Labs is an approximate contemporary of
Windows NT, and another early adopter of Unicode. Except that Plan 9
used UTF-8 throughout, and therefore doesn't afflict all future
developers with the problems caused by the Unicode expansion. (Indeed,
UTF-8 was invented by Rob Pike and Ken Thompson specifically as an
ASCII-compatible and endianness-braindamage-resistant encoding for use
in Plan 9.)
Essentially using UTF-16 increases the chances of test blind spots.
Oh, yes, indeed. I was thinking about mentioning that, but I thought
I'd gone on long enough already. I've probably outstayed my welcome
quite thoroughly this time.
-- [mdw]
.
- Follow-Ups:
- Re: Unicode string libraries
- From: Paul Hsieh
- Re: Unicode string libraries
- References:
- Unicode string libraries
- From: Jon Harrop
- Re: Unicode string libraries
- From: Mark Wooding
- Re: Unicode string libraries
- From: Paul Hsieh
- Unicode string libraries
- Prev by Date: test for single 1-bit or two consecutive 1-bits
- Next by Date: Re: test for single 1-bit or two consecutive 1-bits
- Previous by thread: Re: Unicode string libraries
- Next by thread: Re: Unicode string libraries
- Index(es):
Relevant Pages
|