Re: (OT) Q on Unicode and XML

Hi Helmut,

Helmut Giese writes:
> So someone will devise a neat trick to extend UTF-16 (or has done so
> already),

UTF-16 encodes characters above U+FFFF as sequences of two 16-bit
numbers from a region specifically reserved for this purpose
(so-called "surrogate pairs") and has done so since those code points
were allowed.

Most of the generic text drawing libraries *claim* to support UTF-16
(as opposed to UCS-2 which is the old 16-bit encoding without
surrogate pairs). But I am sceptical that there are many real world
texts with such characters yet, so there probably hasn't been that
much experience and testing yet.

> but then its advantage (each code point takes 2 bytes) will be gone.

If you want to support hieroglyphics, rare kanjis and similar stuff.
These code points are used for rather specialized things.

Other advantages of UTF-16 are that it is smaller than UTF-8 for code
points above U+07FF (i.e. Chinese, Japanese and Korean) and that
Windows and Mac OS X use it as their native encoding.

> Doesn't it make sense then to choose a format like UTF-8 where you
> will be safe for the forseeable future? What do you think?

I'd say go with whatever you need and what support your tools have.
XML as such can use everything, legacy encodings, UTF-8 or UTF-16.
But you want to use it with actual software. So you want to use
whatever your text editor and XML parsers actually handle

Also remember that for translation work you want to be able to work
with non-technical people, so you want to avoid the need for technical
details in the workflow.