Re: Code review: UTF-8

From: Paul Hsieh (qed_at_pobox.com)
Date: 03/06/04


Date: 5 Mar 2004 16:23:03 -0800


"Arthur J. O'Dwyer" <ajo@nospam.andrew.cmu.edu> wrote:
> I'm currently working on stuff involving Unicode encodings.
> I would appreciate it if anyone could look at this C program
>
> http://www.contrib.andrew.cmu.edu/~ajo/utf8latex/unitrans.c

- A Unicode code point is only defined up to the maximum value of
  0x10FFFF. Moreover, the ranges 0xD800 - 0xDFFF are not valid, and
  neither are the specific values 0xFFFE and 0xFFFF. You need to
  range check your decoding result and if they are not a valid
  unicode value you are supposed to read it as the "decoding error"
  code point.

  (Notably, the maximum number of bytes for a UTF-8 code point
  encoding is 4 octets, not 6.)

  Your encoders should also check these value ranges and emit the
  "decoding error" code point, or take some other action upon
  detection of this error.

- There are numerous ways in which a Unicode code point can be encoded
  by the "UTF-8" transformation. However, the unique shortest encoding
  is the only one that's legal. In the decoder you are supposed to
  detect an unnecessarily long redundant encoding and render it as
  the "decoding error" code point. For example, C1 BF is a false
  encoding of 7F.

  With this enforced encoding regime, comparisons can be done on
  either the UTF-8 encoding or the raw Unicode code point values with
  the equivalent result.

- The BOM meta character (FEFF) is used only specifically for
  multibyte encodings, and itself has no meaning as a character in
  UTF-8 streams. What this means is that your UTF-8 encoder should
  not emit a BOM character if encountered and your decoder should
  ignore them if encountered.

  When starting to emit a UTF-16 (or UCS2 or UCS4) sequence you should
  encode the BOM, but then ignore subsequent occurrences in the source
  data. The BOM character is supposed to characterize the encoding,
  and should not be considered actual raw data.

  Reading the first 4 characters of a file you can help discriminate
  between encodings:

    {=FE, =FF, ???, ???} -> UTF-16/UCS2
    {=FF, =FE, =00, =00} -> reversed endian UTF-16/UCS2 or UTF-32/UCS4
    {=FF, =FE, ???, ???} -> reversed endian UTF-16/UCS2
    {=00, =00, =FE, =FF} -> UTF-32/UCS4
    {<F8, <F8, <F8, <F8} -> UTF-8 or ASCII

  As you can see there remains some confusion with the 32bit and 16bit
  formats. Besides the fact that the 32bit formats are necessarily
  bulky, this is a good reason to avoid 32bit encoded formats
  altogether in general. I.e., 32 bit formats should only exist as an
  internal format, not something you expose in a stored file.

The documentation of Unicode is a little confused, by virtue of it
changing (quite rapidly, as far as international standards go) a
number of times. For example, the older UCS4 standard said that it
could legally encode values up to 0x7FFFFFFF, and that only UTF-32
checked the legal ranges. With the latest Unicode standard, the old
UCS4 encoding has been obsoleted, and now UCS4 adopts the UTF-32 legal
range requirements so that it is essentially equivalent to UTF-32. So
the range check actually *is* required in your UCS4 code.

You should also realize the UCS2 and UTF-16 are not the same thing.
UTF-16 uses the 0xD800-0xDF00 range (called surogate pairs) as escape
characters that allow it to encode the upper Unicode code point range
(from 0x10000 to 0x10FFFF.) However, to be correct, UCS2 must
disallow this range (and perform the other legal range checks) as
well. In this way, UCS2 is just a subset of UTF-16. The problem with
the 16-bit encodings, is that when eyeballing mostly western text, you
can't be sure of whether its encoded in UCS2 or UTF-16.

I have more discussion of this here:

    http://www.azillionmonkeys.com/qed/unicode.html

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/


Relevant Pages

  • Re: C# and encodings
    ... But if windows has numerous code pages, ... encoding, and thus have only 255 code points matched to characters? ... Unicode can't be represented in only 8-bits, ... But Notepad supports Unicode and yet it only recognizes 255 character, ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: [PHP] First stupid post of the year. [SOLVED]
    ... one can argue how many bytes are needed to represent a character ... in what encoding, but that doesn't change the character. ... Unicode it is called U+00A0. ... there are a few ways to encode U+00A0. ...
    (php.general)
  • C# and encodings
    ... Can code page support Unicode coded character set, ... Are there also 8-bit code pages which use Unicode character ... encoding, and thus have only 255 code points matched to characters? ... mark written in UTF-8. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: POSTing Chinese characters
    ... For the example string I mention, simply encode as ... the client locale could be anywhere... ... > The basic idea of %-encoding is to treat character encoding as a sequence ...
    (microsoft.public.inetserver.iis)
  • Re: Try this
    ... Because that's the absence of encoding? ... If you want to understand what happens here: The Unicode block for 'CJK ... Unified Han' goes from U+4E00 to U+9FFF and is the largest block in the ... would collapse each two letters into a single character, ...
    (comp.lang.python)