Re: Code review: UTF-8
From: Paul Hsieh (qed_at_pobox.com)
Date: 03/06/04
- Next message: Michael Mendelsohn: "Re: ADTs"
- Previous message: Richard Heathfield: "Re: C program behaves strangely"
- In reply to: Arthur J. O'Dwyer: "Code review: UTF-8"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 5 Mar 2004 16:23:03 -0800
"Arthur J. O'Dwyer" <ajo@nospam.andrew.cmu.edu> wrote:
> I'm currently working on stuff involving Unicode encodings.
> I would appreciate it if anyone could look at this C program
>
> http://www.contrib.andrew.cmu.edu/~ajo/utf8latex/unitrans.c
- A Unicode code point is only defined up to the maximum value of
0x10FFFF. Moreover, the ranges 0xD800 - 0xDFFF are not valid, and
neither are the specific values 0xFFFE and 0xFFFF. You need to
range check your decoding result and if they are not a valid
unicode value you are supposed to read it as the "decoding error"
code point.
(Notably, the maximum number of bytes for a UTF-8 code point
encoding is 4 octets, not 6.)
Your encoders should also check these value ranges and emit the
"decoding error" code point, or take some other action upon
detection of this error.
- There are numerous ways in which a Unicode code point can be encoded
by the "UTF-8" transformation. However, the unique shortest encoding
is the only one that's legal. In the decoder you are supposed to
detect an unnecessarily long redundant encoding and render it as
the "decoding error" code point. For example, C1 BF is a false
encoding of 7F.
With this enforced encoding regime, comparisons can be done on
either the UTF-8 encoding or the raw Unicode code point values with
the equivalent result.
- The BOM meta character (FEFF) is used only specifically for
multibyte encodings, and itself has no meaning as a character in
UTF-8 streams. What this means is that your UTF-8 encoder should
not emit a BOM character if encountered and your decoder should
ignore them if encountered.
When starting to emit a UTF-16 (or UCS2 or UCS4) sequence you should
encode the BOM, but then ignore subsequent occurrences in the source
data. The BOM character is supposed to characterize the encoding,
and should not be considered actual raw data.
Reading the first 4 characters of a file you can help discriminate
between encodings:
{=FE, =FF, ???, ???} -> UTF-16/UCS2
{=FF, =FE, =00, =00} -> reversed endian UTF-16/UCS2 or UTF-32/UCS4
{=FF, =FE, ???, ???} -> reversed endian UTF-16/UCS2
{=00, =00, =FE, =FF} -> UTF-32/UCS4
{<F8, <F8, <F8, <F8} -> UTF-8 or ASCII
As you can see there remains some confusion with the 32bit and 16bit
formats. Besides the fact that the 32bit formats are necessarily
bulky, this is a good reason to avoid 32bit encoded formats
altogether in general. I.e., 32 bit formats should only exist as an
internal format, not something you expose in a stored file.
The documentation of Unicode is a little confused, by virtue of it
changing (quite rapidly, as far as international standards go) a
number of times. For example, the older UCS4 standard said that it
could legally encode values up to 0x7FFFFFFF, and that only UTF-32
checked the legal ranges. With the latest Unicode standard, the old
UCS4 encoding has been obsoleted, and now UCS4 adopts the UTF-32 legal
range requirements so that it is essentially equivalent to UTF-32. So
the range check actually *is* required in your UCS4 code.
You should also realize the UCS2 and UTF-16 are not the same thing.
UTF-16 uses the 0xD800-0xDF00 range (called surogate pairs) as escape
characters that allow it to encode the upper Unicode code point range
(from 0x10000 to 0x10FFFF.) However, to be correct, UCS2 must
disallow this range (and perform the other legal range checks) as
well. In this way, UCS2 is just a subset of UTF-16. The problem with
the 16-bit encodings, is that when eyeballing mostly western text, you
can't be sure of whether its encoded in UCS2 or UTF-16.
I have more discussion of this here:
http://www.azillionmonkeys.com/qed/unicode.html
-- Paul Hsieh http://www.pobox.com/~qed/ http://bstring.sf.net/
- Next message: Michael Mendelsohn: "Re: ADTs"
- Previous message: Richard Heathfield: "Re: C program behaves strangely"
- In reply to: Arthur J. O'Dwyer: "Code review: UTF-8"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|