Re: (OT) Q on Unicode and XML

Hi Helmut,

Helmut Giese writes:
> So someone will devise a neat trick to extend UTF-16 (or has done so
> already),

UTF-16 encodes characters above U+FFFF as sequences of two 16-bit
numbers from a region specifically reserved for this purpose
(so-called "surrogate pairs") and has done so since those code points
were allowed.

Most of the generic text drawing libraries *claim* to support UTF-16
(as opposed to UCS-2 which is the old 16-bit encoding without
surrogate pairs). But I am sceptical that there are many real world
texts with such characters yet, so there probably hasn't been that
much experience and testing yet.

> but then its advantage (each code point takes 2 bytes) will be gone.

If you want to support hieroglyphics, rare kanjis and similar stuff.
These code points are used for rather specialized things.

Other advantages of UTF-16 are that it is smaller than UTF-8 for code
points above U+07FF (i.e. Chinese, Japanese and Korean) and that
Windows and Mac OS X use it as their native encoding.

> Doesn't it make sense then to choose a format like UTF-8 where you
> will be safe for the forseeable future? What do you think?

I'd say go with whatever you need and what support your tools have.
XML as such can use everything, legacy encodings, UTF-8 or UTF-16.
But you want to use it with actual software. So you want to use
whatever your text editor and XML parsers actually handle

Also remember that for translation work you want to be able to work
with non-technical people, so you want to avoid the need for technical
details in the workflow.


Relevant Pages

  • Re: New utf8string design may make UTF-8 the superior encoding
    ... The host operating system's native Unicode encoding is unlikely to be UTF-8, ... Manipulating UTF-16 will always be more efficient than ... I am curious what a Chinese "letter" is according to the regexp. ...
  • Re: =?ISO-8859-15?Q?Wof=FCr_sind_AnsiStrings_=FCberhaupt_?= =?ISO-8859-15?Q?noch_bra
    ... Fehler in Design und Implementierung der neuen AnsiStrings ... AnsiStrings mit unterschiedlichem Encoding praktisch unbrauchbar sind. ... Damit werden AnsiStrings mit anderen Encodings weiterhin nach UTF-16 gewandelt, ... dort Strings und Literale nur in "nativ" codiert, sonst UTF-8 oder was der Benutzer auch immer vorgibt. ...
  • Re: Unicode string libraries
    ... encoding negotiation. ... old languages which have adopted Unicode without much pain. ... compatibility with too many old programs; but char as a holder for UTF-8 ... The limitations of UTF-16 ...
  • Re: UTF-8 JavaScript files
    ... application of XML) has two default character encodings defined (that ... The default is not limited to UTF-8 and UTF-16LE. ... | encoding for its characters. ... | Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin ...
  • Re: Unicode format
    ... UTF-8 does take fewer bytes than UTF-16LE. ... A text byte stream cannot be losslessly converted to UTF-16, ... the possible presence of errors in the byte stream encoding. ... Why would that be different with UTF-16 over UTF-8? ...