Re: Multibyte string length
From: Micah Cowan (micah_at_cowan.name)
Date: 10/12/03
- Next message: Micah Cowan: "Re: How to find out the size of an array?"
- Previous message: Micah Cowan: "Re: why does this work ?"
- In reply to: Sheldon Simms: "Re: Multibyte string length"
- Next in thread: Sheldon Simms: "Re: Multibyte string length"
- Reply: Sheldon Simms: "Re: Multibyte string length"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 12 Oct 2003 13:29:25 -0700
Sheldon Simms <sheldonsimms@yahoo.com> writes:
> On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
> name wrote:
>
> > in comp.lang.c i read:
> >
> >>Now if wchar_t is not forced to able to contain a full character then
> >>again we are stuck at our multibyte (multi-some-unit) character
> >>sequence with all of its inconveniances. This IMHO defeats the whole
> >>purpose of wchar_t.
> >
> > wchar_t is required to have a range that can handle all the code points
> > which can arise from the use of any locale supported by the implementation.
> > c99 takes this further: the implementation can indicate to the programmer
> > if iso-10646 is directly supported (though the encoding is *not* required
> > to be ucs-4)
>
> I guess you're saying the encoding is not required to be ucs-4 because
> the standard doesn't explicitly say so:
>
> 6.10.8.2
> ...
> __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
> example, 199712L), intended to indicate that values of type wchar_t
> are the coded representations of the characters defined by ISO/IEC
> 10646, along with all amendments and technical corrigenda as of the
> specified year and month.
>
> But if the encoding is not ucs-4, then what could it possibly be?
> 7.17.2 says
>
> wchar_t which is an integer type whose range of values can represent
> distinct codes for all members of the largest extended character set
> specified among the supported locales;
>
> As I read this, it means that in implementations implementing ISO 10646
> must have a wchar_t capable of representing over 1 million distinct
> values. Given this requirement, ucs-4 seems to be the only reasonable
> encoding to use for ISO 10646 wide character strings.
No; the ISO 10646 and Unicode standards are 16-bit
encodings. Some 16-bit codes work together (high/low surrogates)
to produce the effect of a "single" character from two encoded
characters; however, that does not change the fact that the
standards themselves claim to present 16-bit encodings (Actually,
for ISO 10646 I'm making some assumptions, as I've not read it;
only Unicode). Not only this, but while support is in place for
character codes 0x10000 and above, no character codes have
actually been defined for these values, and so UCS-2/UTF-16 can
safely be used to encode "all members of the largest extended
character set".
> Would an implementation that used utf-8 encoding in wide character
> strings composed of 32-bit wchar_t be conforming?
I don't think so, no.
-Micah
- Next message: Micah Cowan: "Re: How to find out the size of an array?"
- Previous message: Micah Cowan: "Re: why does this work ?"
- In reply to: Sheldon Simms: "Re: Multibyte string length"
- Next in thread: Sheldon Simms: "Re: Multibyte string length"
- Reply: Sheldon Simms: "Re: Multibyte string length"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|