Re: Unicode string libraries



On Jan 3, 5:27 am, Mark Wooding <m...@xxxxxxxxxxxxxxxx> wrote:
Jon Harrop <j...@xxxxxxxxxxxxxxxxx> wrote:
I am trying to figure out how best to represent strings in a new
language and am finding it vastly harder than I had imagined.

Yes, it's difficult, sorry!

As one of my coworkers says -- "Just teach everyone dutch and everyone
will be happy".

What do other language implementations (e.g. Perl, Python, Ruby)
with rich string libraries do? Do they really just reinvent the wheel when
it comes to sequences of characters?

Pretty much, yes.

I know that Perl uses UTF-8 as its internal string representation. This
initially sounds strange, since UTF-8 has serious non-uniformity
problems, much worse than UTF-16.

Well actually I think its a "Linuxism" to use UTF-8. Or at least a
non-Sun UNIXism. UTF-8, on average, tends to have the best memory
usage of all the formats (*unless* you are dominated by Asian
characters defined within the BMP[1]). In particular, it rules on
ASCII, of course. Not surprisingly it was the path Perl took.

[...] But Perl uses Boyer-Moore string
searching extensively, which involves a number of tables indexed by
`character': if you use UTF-8, these tables remain quite small so they
use less memory and fit well in cache;

Indeed, but its a technical solution that's not all that useful in
practice. To determine if two unicode strings binary comparisons are
typically insufficient. There are other string searching algorithms,
while not as fast as Boyer-Moore in theory, are nearly as fast in
practice and are much more easily matched to Unicode's non-unique
encoding scheme.

[...] the magic of UTF-8 means that a
search on UTF-8 encodings is equivalent to a search on Unicode
characters.

Well, the same is true of UTF-16 so long as you have matching
endianess. The real benefit of using UTF-8 is that you can perform
any transform on it that works on ASCII embedded in 8-bits that
preserves non-ASCII. So SIMD tricks that change the case of ASCII,
for example, can work without modification.

My memory is telling me that Python uses UTF-32 internally, though it's
changed a few times (in a way that's almost transparent to the Python
programmer). Last time I looked, Ruby just didn't do Unicode. SBCL
and GNU CLisp both use UTF-32.

I was initially partial to UTF-32 as well, as it gives you array-like
code point index access. But code point indexes are not the same as
grapheme indexes, so you don't really gain anything by doing this.
You have to compute and remember your grapheme indexes either way. So
it makes sense to choose other criteria for your internal encoding.

Java and C# both use UTF-16. This is understandable for Java, because
Sun adopted Unicode before it expanded from 16 to 20.09 bits[1], but
less comprehensible for C#. Both of these languages also provide
criminally weak abstractions for characters, inherited from C: their
`char' types are merely 16-bit integers, which means that, in
particular, they're not actually capable of representing all characters,
and occasionally end up representing half of a Unicode surrogate pair
instead.

Well this criminal weakness actually started from the Unicode standard
itself, remember. They thought 16 bits would be good enough to cover
everything, without even bothering to count the number of Asian
characters. This hack to use surrogates to cover a much larger range
caused the UTF-8 range to get truncated to accommodate for UTF-16's
limits (so that they could unify Unicode and ISO 10646). Sun and
Microsoft's main problem is that they jumped on the Unicode ship too
early (or perhaps they were part of the problem -- I am not *that*
familiar with the early history of Unicode).

The Chinese needed a migration path from Big5, and the older 16 bit
(UCS-2) Unicode was clearly not it. Its seems that only after it was
obvious that they were going to try to find their own solution that
the Unicode people finally got the hint and adopted the ISO 10646
model. Apparently there are some Asians that are still not satisfied
with it, but at least its better than Big5. The PRC has decided to go
with their own standard GB 18030, but its just a translation of
Unicode (it encodes characters outside of the BMP, and a few extended
characters mapped into the private areas.) This is significant
because it means they will recognize things like Tibetan symbols as a
necessary concession to be able to encode the rest of the world's
text. (So there will be no convenient "disappearing" of history for
the sake of politics.)

UTF-16 seems like the worst of both worlds to me: the non-uniformity of
UTF-8 and the large encoding unit of UTF-32. Your milage may vary.

Its worse than that, because most english and european text can fit
into the BMP. So if as a programmer you make the mistake of thinking
each "16bit character" was actually a grapheme, you won't know you are
making a mistake unless you test with Asian or other highly exotic
text. Essentially using UTF-16 increases the chances of test blind
spots.

[1] BMP = base multi-plane = the legal unicode code points < 65536.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

.



Relevant Pages

  • [PATCH] UTF-8 input: composing non-latin1 characters, and copy-paste
    ... One can put the keyboard driver into Unicode mode, load a Unicode keymap, and get single keystrokes generate valid UTF-8 for non-ASCII characters. ...
    (Linux-Kernel)
  • Re: Unicode string libraries
    ... encoding negotiation. ... old languages which have adopted Unicode without much pain. ... compatibility with too many old programs; but char as a holder for UTF-8 ... The limitations of UTF-16 ...
    (comp.programming)
  • Re: Help me!! Why java is so popular
    ... Well, Unicode is not a storage encoding system, or anything like that. ... Unicode is primarily a mapping from characters (in the linguistic conceptual ... French, Russian, Japanese and Korean songs. ...
    (comp.lang.java.programmer)
  • =?utf-8?B?UmU6IFN0cmluZyAiw6LigqzihKIiIHRyYW5zbGF0ZWQgdG8gYXBvc3Ryb3BoZS4gV2h5Pw==?=
    ... it works), though it seems to use mostly just Ascii characters, representing ... but the author is not making the best possible use of UTF-8. ... They don't map it to ASCII apostrophe, ... Latin 1 encoding. ...
    (alt.html)
  • Re: convert from utf-8 to unicode(excel)
    ... Is there a possibility to properly convert under Windows from utf-8 ... encoding to unicode ... There is no problem in conversion when I do it in Notepad. ... a file marking encoding as UTF-8 and then save it marking encoding as ...
    (comp.editors)