Re: Multibyte string length
From: Dan Pop (Dan.Pop_at_cern.ch)
Date: 10/13/03
- Next message: Eric Sosman: "Re: How to find out the size of an array?"
- Previous message: Joona I Palaste: "Re: How do I overload functions in C?"
- In reply to: Sheldon Simms: "Re: Multibyte string length"
- Next in thread: Sheldon Simms: "Re: Multibyte string length"
- Reply: Sheldon Simms: "Re: Multibyte string length"
- Reply: Dingo: "Re: Multibyte string length"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 13 Oct 2003 14:18:31 GMT
In <pan.2003.10.12.00.30.01.29779@yahoo.com> Sheldon Simms <sheldonsimms@yahoo.com> writes:
>On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
>name wrote:
>
>> in comp.lang.c i read:
>>
>>>Now if wchar_t is not forced to able to contain a full character then
>>>again we are stuck at our multibyte (multi-some-unit) character
>>>sequence with all of its inconveniances. This IMHO defeats the whole
>>>purpose of wchar_t.
>>
>> wchar_t is required to have a range that can handle all the code points
>> which can arise from the use of any locale supported by the implementation.
>> c99 takes this further: the implementation can indicate to the programmer
>> if iso-10646 is directly supported (though the encoding is *not* required
>> to be ucs-4)
>
>I guess you're saying the encoding is not required to be ucs-4 because
>the standard doesn't explicitly say so:
>
> 6.10.8.2
> ...
> __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
> example, 199712L), intended to indicate that values of type wchar_t
> are the coded representations of the characters defined by ISO/IEC
> 10646, along with all amendments and technical corrigenda as of the
> specified year and month. ^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^
>But if the encoding is not ucs-4, then what could it possibly be?
>7.17.2 says
>
> wchar_t which is an integer type whose range of values can represent
> distinct codes for all members of the largest extended character set
> specified among the supported locales;
Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
as being "the largest extended character set specified among the
supported locales" and, therefore, having wchar_t defined as char?
>As I read this, it means that in implementations implementing ISO 10646
>must have a wchar_t capable of representing over 1 million distinct
>values.
It depends on the actual value of the __STDC_ISO_10646__, which could
point to an earlier version of ISO 10646, or not be defined at all,
as in my ASCII example above.
>Given this requirement, ucs-4 seems to be the only reasonable
>encoding to use for ISO 10646 wide character strings.
If the implementation chooses to support a recent enough version of the
ISO 10646. Which the standard allows but doesn't require. The first
incarnation of ISO 10646 only specified 34203 characters, so a 16-bit
wchar_t would be enough for an implementation defining __STDC_ISO_10646__.
>Would an implementation that used utf-8 encoding in wide character
>strings composed of 32-bit wchar_t be conforming?
No way. utf-8 encodings need not fit in a 32-bit wchar_t (they take one
to six octets). They are clearly intended to be used in multibyte
character strings, which are composed of plain char's (e.g. printf's
format string).
Dan
-- Dan Pop DESY Zeuthen, RZ group Email: Dan.Pop@ifh.de
- Next message: Eric Sosman: "Re: How to find out the size of an array?"
- Previous message: Joona I Palaste: "Re: How do I overload functions in C?"
- In reply to: Sheldon Simms: "Re: Multibyte string length"
- Next in thread: Sheldon Simms: "Re: Multibyte string length"
- Reply: Sheldon Simms: "Re: Multibyte string length"
- Reply: Dingo: "Re: Multibyte string length"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|