Re: Unicode string libraries



Jon Harrop <jon@xxxxxxxxxxxxxxxxx> wrote:

I am trying to figure out how best to represent strings in a new
language and am finding it vastly harder than I had imagined.

Yes, it's difficult, sorry!

What do other language implementations (e.g. Perl, Python, Ruby) with
rich string libraries do? Do they really just reinvent the wheel when
it comes to sequences of characters?

Pretty much, yes.

I know that Perl uses UTF-8 as its internal string representation. This
initially sounds strange, since UTF-8 has serious non-uniformity
problems, much worse than UTF-16. But Perl uses Boyer-Moore string
searching extensively, which involves a number of tables indexed by
`character': if you use UTF-8, these tables remain quite small so they
use less memory and fit well in cache; the magic of UTF-8 means that a
search on UTF-8 encodings is equivalent to a search on Unicode
characters.

My memory is telling me that Python uses UTF-32 internally, though it's
changed a few times (in a way that's almost transparent to the Python
programmer). Last time I looked, Ruby just didn't do Unicode. SBCL and
GNU CLisp both use UTF-32.

Java and C# both use UTF-16. This is understandable for Java, because
Sun adopted Unicode before it expanded from 16 to 20.09 bits[1], but
less comprehensible for C#. Both of these languages also provide
criminally weak abstractions for characters, inherited from C: their
`char' types are merely 16-bit integers, which means that, in
particular, they're not actually capable of representing all characters,
and occasionally end up representing half of a Unicode surrogate pair
instead.

UTF-16 seems like the worst of both worlds to me: the non-uniformity of
UTF-8 and the large encoding unit of UTF-32. Your milage may vary.

I know of a few text re-encoding packages out there. There's the
POSIX-standard iconv function; there's GNU recode; there's Simon
Tatham's charset library, which is used in PuTTY... I think Perl and
Python each have their own character re-encoding systems.

[1] Yes, it's a fractional number of bits. Live with it.

-- [mdw]
.



Relevant Pages

  • Re: Unicode Delphi Win32 - which approach
    ... I like the backwards compatibility aspects of UTF-8 vs UTF-16. ... The first 256 Unicode characters map to the ANSI character set. ... entire stream> but calling an API 100 times in a loop I can imagine. ... and explicitly contextualise every string. ...
    (borland.public.delphi.non-technical)
  • [PATCH] UTF-8 input: composing non-latin1 characters, and copy-paste
    ... One can put the keyboard driver into Unicode mode, load a Unicode keymap, and get single keystrokes generate valid UTF-8 for non-ASCII characters. ...
    (Linux-Kernel)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • Re: Unicode string libraries
    ... UTF-8 is the encoding that must be used ... I initially thought that the variable-length characters ... but also that UTF-8 didn't break when Unicode got extended ...
    (comp.programming)
  • Re: Fast UTF-8 strlen function
    ... >> Is there a fast UTF-8 string length function floating around? ... Length in bytes, or length in characters? ... For UTF-8, the main basic "change" you have to make to your string routines ... then I could individually look up the characters in my UNICODE ...
    (alt.lang.asm)