Re: Unicode string libraries
- From: Mark Wooding <mdw@xxxxxxxxxxxxxxxx>
- Date: Sat, 3 Jan 2009 13:27:39 +0000 (UTC)
Jon Harrop <jon@xxxxxxxxxxxxxxxxx> wrote:
I am trying to figure out how best to represent strings in a new
language and am finding it vastly harder than I had imagined.
Yes, it's difficult, sorry!
What do other language implementations (e.g. Perl, Python, Ruby) with
rich string libraries do? Do they really just reinvent the wheel when
it comes to sequences of characters?
Pretty much, yes.
I know that Perl uses UTF-8 as its internal string representation. This
initially sounds strange, since UTF-8 has serious non-uniformity
problems, much worse than UTF-16. But Perl uses Boyer-Moore string
searching extensively, which involves a number of tables indexed by
`character': if you use UTF-8, these tables remain quite small so they
use less memory and fit well in cache; the magic of UTF-8 means that a
search on UTF-8 encodings is equivalent to a search on Unicode
characters.
My memory is telling me that Python uses UTF-32 internally, though it's
changed a few times (in a way that's almost transparent to the Python
programmer). Last time I looked, Ruby just didn't do Unicode. SBCL and
GNU CLisp both use UTF-32.
Java and C# both use UTF-16. This is understandable for Java, because
Sun adopted Unicode before it expanded from 16 to 20.09 bits[1], but
less comprehensible for C#. Both of these languages also provide
criminally weak abstractions for characters, inherited from C: their
`char' types are merely 16-bit integers, which means that, in
particular, they're not actually capable of representing all characters,
and occasionally end up representing half of a Unicode surrogate pair
instead.
UTF-16 seems like the worst of both worlds to me: the non-uniformity of
UTF-8 and the large encoding unit of UTF-32. Your milage may vary.
I know of a few text re-encoding packages out there. There's the
POSIX-standard iconv function; there's GNU recode; there's Simon
Tatham's charset library, which is used in PuTTY... I think Perl and
Python each have their own character re-encoding systems.
[1] Yes, it's a fractional number of bits. Live with it.
-- [mdw]
.
- Follow-Ups:
- Re: Unicode string libraries
- From: Paul Hsieh
- Re: Unicode string libraries
- From: CBFalconer
- Re: Unicode string libraries
- From: Jon Harrop
- Re: Unicode string libraries
- References:
- Unicode string libraries
- From: Jon Harrop
- Unicode string libraries
- Prev by Date: Re: Unicode string libraries
- Next by Date: Re: Better verification: checksum vs. xor
- Previous by thread: Re: Unicode string libraries
- Next by thread: Re: Unicode string libraries
- Index(es):
Relevant Pages
|