Re: Multibyte string length

From: Micah Cowan (micah_at_cowan.name)
Date: 10/12/03


Date: 12 Oct 2003 13:29:25 -0700

Sheldon Simms <sheldonsimms@yahoo.com> writes:

> On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
> name wrote:
>
> > in comp.lang.c i read:
> >
> >>Now if wchar_t is not forced to able to contain a full character then
> >>again we are stuck at our multibyte (multi-some-unit) character
> >>sequence with all of its inconveniances. This IMHO defeats the whole
> >>purpose of wchar_t.
> >
> > wchar_t is required to have a range that can handle all the code points
> > which can arise from the use of any locale supported by the implementation.
> > c99 takes this further: the implementation can indicate to the programmer
> > if iso-10646 is directly supported (though the encoding is *not* required
> > to be ucs-4)
>
> I guess you're saying the encoding is not required to be ucs-4 because
> the standard doesn't explicitly say so:
>
> 6.10.8.2
> ...
> __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
> example, 199712L), intended to indicate that values of type wchar_t
> are the coded representations of the characters defined by ISO/IEC
> 10646, along with all amendments and technical corrigenda as of the
> specified year and month.
>
> But if the encoding is not ucs-4, then what could it possibly be?
> 7.17.2 says
>
> wchar_t which is an integer type whose range of values can represent
> distinct codes for all members of the largest extended character set
> specified among the supported locales;
>
> As I read this, it means that in implementations implementing ISO 10646
> must have a wchar_t capable of representing over 1 million distinct
> values. Given this requirement, ucs-4 seems to be the only reasonable
> encoding to use for ISO 10646 wide character strings.

No; the ISO 10646 and Unicode standards are 16-bit
encodings. Some 16-bit codes work together (high/low surrogates)
to produce the effect of a "single" character from two encoded
characters; however, that does not change the fact that the
standards themselves claim to present 16-bit encodings (Actually,
for ISO 10646 I'm making some assumptions, as I've not read it;
only Unicode). Not only this, but while support is in place for
character codes 0x10000 and above, no character codes have
actually been defined for these values, and so UCS-2/UTF-16 can
safely be used to encode "all members of the largest extended
character set".

> Would an implementation that used utf-8 encoding in wide character
> strings composed of 32-bit wchar_t be conforming?

I don't think so, no.

-Micah



Relevant Pages

  • Re: Multibyte string length
    ... >I guess you're saying the encoding is not required to be ucs-4 because ... > distinct codes for all members of the largest extended character set ... Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1 ... If the implementation chooses to support a recent enough version of the ...
    (comp.lang.c)
  • mined: Unicode text editor back for minix?
    ... Mined provides both extensive Unicode and CJK support offering many ... specific features and covering special cases that other editors ... of terminal variations, or Han character information). ... Versatile character encoding support ...
    (comp.os.minix)
  • Unicode text editor mined 2000 release 14
    ... Mined provides both extensive Unicode and CJK support offering many ... New command Alt-x toggles preceding character and its hexadecimal code. ... just determines and displays terminal encoding. ... supporting wide range of terminals ...
    (comp.editors)
  • Unicode text editor mined 2000 release 14
    ... Mined provides both extensive Unicode and CJK support offering many ... New command Alt-x toggles preceding character and its hexadecimal code. ... just determines and displays terminal encoding. ... supporting wide range of terminals ...
    (de.comp.editoren)
  • Re: XML file and & character as data
    ... I have a XML file using Iso 8859-1 encoding. ... How should people code & character as an ordinary data character? ...
    (comp.text.xml)