Re: Unicode string libraries



On Jan 4, 12:51 pm, Mark Wooding <m...@xxxxxxxxxxxxxxxx> wrote:
Paul Hsieh <websn...@xxxxxxxxx> wrote:
Well actually I think its a "Linuxism" to use UTF-8.  Or at least a
non-Sun UNIXism.

If anything, it's an IETF-ism.  UTF-8 is the encoding that must be used
in IETF standardized protocols if there's no space in the protocol for
encoding negotiation.  This basically makes it closest thing I can think
of to an `Internet standard text encoding'. [...]

That sounds more believable than my speculation; it explains why Linux
would adopt it without controversy.

[...] I'm sure there are other
old languages which have adopted Unicode without much pain.  Well, Perl
and Python did, certainly.  C suffers because it failed early on to
distinguish `char' from `byte' properly, causing a tension when `char'
wanted to expand and `byte' certainly couldn't without wrecking
compatibility with too many old programs; but char as a holder for UTF-8
and wchar_t as a holder for Unicode code points seems both sane and
fairly common practice.

Exactly what the C language committee was thinking in adopting wchar_t
as some amorphous non-specified alternate text type is anyone's
guess. By today its trivial to see that anything other than Unicode
is laughable.

 This hack to use surrogates to cover a much larger range
caused the UTF-8 range to get truncated to accommodate for UTF-16's
limits (so that they could unify Unicode and ISO 10646).

Huh?  UTF-8 doesn't encode surrogate pairs: the code points used for the
surrogates are just not valid in UTF-8 at all.

No, I mean that the original UTF-8 encodes 31-bit code points and uses
the full range of bit patterns to do so. The limitations of UTF-16
meant that the surrogate pair solution was really the best they could
do (to maintain backward compatibility with the old UCS-2 scheme and
making enough space available to encode all known and potential text),
and this meant that the Unicode range could not exceed 0x110000. So
to be consistent the 5 and 6 byte UTF-8 code patterns got dropped and
are now considered, essentially, illegal. In other words, UTF-16
parsers *HAD* to change because it was flawed from the beginning, but
also UTF-8 end up needing changes as well to compensate for the errors
of UTF-16 that it had nothing to do with.

Sun and Microsoft's main problem is that they jumped on the Unicode
ship too early (or perhaps they were part of the problem -- I am not
*that* familiar with the early history of Unicode).

Windows NT and Java both jumped on too early, and got burned by the
later expansion.  C# was designed /after/ the expansion, but suffers
from stolen braindamage.  (I don't have a problem with borrowing good
language features, and Java does have a few.  I have a massive problem
with borrowing stupid features, and the overspecified 16-bit-integer
`char'-which-can't-actually-represent-characters is definitely one of
those.)

Microsoft also has a very large investment in their native support for
UTF-16. I am sure this was the most important consideration.

Interestingly, Plan 9 from Bell Labs is an approximate contemporary of
Windows NT, and another early adopter of Unicode.  Except that Plan 9
used UTF-8 throughout, and therefore doesn't afflict all future
developers with the problems caused by the Unicode expansion.  (Indeed,
UTF-8 was invented by Rob Pike and Ken Thompson specifically as an
ASCII-compatible and endianness-braindamage-resistant encoding for use
in Plan 9.)

Right -- that's why I thought it was a non-Sun UNIXism.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

.



Relevant Pages

  • Re: New utf8string design may make UTF-8 the superior encoding
    ... The host operating system's native Unicode encoding is unlikely to be UTF-8, ... Manipulating UTF-16 will always be more efficient than ... I am curious what a Chinese "letter" is according to the regexp. ...
    (microsoft.public.vc.mfc)
  • Re: New Years Resolution (was Re: cell phones, was: car help, was: Starving people refuse to eat foo
    ... Its still UTF-8, or rather, a mangled UTF-8, but recognizable to any ... Characters in the range 0-127 require a single byte, ... Unicode is a method of encoding characters with a enough variety to ...
    (rec.arts.sf.written)
  • Re: convert from utf-8 to unicode(excel)
    ... Is there a possibility to properly convert under Windows from utf-8 ... encoding to unicode ... There is no problem in conversion when I do it in Notepad. ... a file marking encoding as UTF-8 and then save it marking encoding as ...
    (comp.editors)
  • Re: Unicode string libraries
    ... UTF-8 is the encoding that must be used ... I initially thought that the variable-length characters ... but also that UTF-8 didn't break when Unicode got extended ...
    (comp.programming)
  • Re: =?ISO-8859-15?Q?Wof=FCr_sind_AnsiStrings_=FCberhaupt_?= =?ISO-8859-15?Q?noch_bra
    ... Fehler in Design und Implementierung der neuen AnsiStrings ... AnsiStrings mit unterschiedlichem Encoding praktisch unbrauchbar sind. ... Damit werden AnsiStrings mit anderen Encodings weiterhin nach UTF-16 gewandelt, ... dort Strings und Literale nur in "nativ" codiert, sonst UTF-8 oder was der Benutzer auch immer vorgibt. ...
    (de.comp.lang.delphi.misc)