Re: Unicode string libraries
- From: Paul Hsieh <websnarf@xxxxxxxxx>
- Date: Mon, 5 Jan 2009 11:27:21 -0800 (PST)
On Jan 4, 12:51 pm, Mark Wooding <m...@xxxxxxxxxxxxxxxx> wrote:
Paul Hsieh <websn...@xxxxxxxxx> wrote:
Well actually I think its a "Linuxism" to use UTF-8. Or at least a
non-Sun UNIXism.
If anything, it's an IETF-ism. UTF-8 is the encoding that must be used
in IETF standardized protocols if there's no space in the protocol for
encoding negotiation. This basically makes it closest thing I can think
of to an `Internet standard text encoding'. [...]
That sounds more believable than my speculation; it explains why Linux
would adopt it without controversy.
[...] I'm sure there are other
old languages which have adopted Unicode without much pain. Well, Perl
and Python did, certainly. C suffers because it failed early on to
distinguish `char' from `byte' properly, causing a tension when `char'
wanted to expand and `byte' certainly couldn't without wrecking
compatibility with too many old programs; but char as a holder for UTF-8
and wchar_t as a holder for Unicode code points seems both sane and
fairly common practice.
Exactly what the C language committee was thinking in adopting wchar_t
as some amorphous non-specified alternate text type is anyone's
guess. By today its trivial to see that anything other than Unicode
is laughable.
This hack to use surrogates to cover a much larger range
caused the UTF-8 range to get truncated to accommodate for UTF-16's
limits (so that they could unify Unicode and ISO 10646).
Huh? UTF-8 doesn't encode surrogate pairs: the code points used for the
surrogates are just not valid in UTF-8 at all.
No, I mean that the original UTF-8 encodes 31-bit code points and uses
the full range of bit patterns to do so. The limitations of UTF-16
meant that the surrogate pair solution was really the best they could
do (to maintain backward compatibility with the old UCS-2 scheme and
making enough space available to encode all known and potential text),
and this meant that the Unicode range could not exceed 0x110000. So
to be consistent the 5 and 6 byte UTF-8 code patterns got dropped and
are now considered, essentially, illegal. In other words, UTF-16
parsers *HAD* to change because it was flawed from the beginning, but
also UTF-8 end up needing changes as well to compensate for the errors
of UTF-16 that it had nothing to do with.
Sun and Microsoft's main problem is that they jumped on the Unicode
ship too early (or perhaps they were part of the problem -- I am not
*that* familiar with the early history of Unicode).
Windows NT and Java both jumped on too early, and got burned by the
later expansion. C# was designed /after/ the expansion, but suffers
from stolen braindamage. (I don't have a problem with borrowing good
language features, and Java does have a few. I have a massive problem
with borrowing stupid features, and the overspecified 16-bit-integer
`char'-which-can't-actually-represent-characters is definitely one of
those.)
Microsoft also has a very large investment in their native support for
UTF-16. I am sure this was the most important consideration.
Interestingly, Plan 9 from Bell Labs is an approximate contemporary of
Windows NT, and another early adopter of Unicode. Except that Plan 9
used UTF-8 throughout, and therefore doesn't afflict all future
developers with the problems caused by the Unicode expansion. (Indeed,
UTF-8 was invented by Rob Pike and Ken Thompson specifically as an
ASCII-compatible and endianness-braindamage-resistant encoding for use
in Plan 9.)
Right -- that's why I thought it was a non-Sun UNIXism.
--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/
.
- Follow-Ups:
- Re: Unicode string libraries
- From: Ben Pfaff
- Re: Unicode string libraries
- References:
- Unicode string libraries
- From: Jon Harrop
- Re: Unicode string libraries
- From: Mark Wooding
- Re: Unicode string libraries
- From: Paul Hsieh
- Re: Unicode string libraries
- From: Mark Wooding
- Unicode string libraries
- Prev by Date: Re: test for single 1-bit or two consecutive 1-bits
- Next by Date: CPLEX randomness
- Previous by thread: Re: Unicode string libraries
- Next by thread: Re: Unicode string libraries
- Index(es):
Relevant Pages
|