Re: Multibyte string length

From: Dan Pop (Dan.Pop_at_cern.ch)
Date: 10/13/03


Date: 13 Oct 2003 14:18:31 GMT

In <pan.2003.10.12.00.30.01.29779@yahoo.com> Sheldon Simms <sheldonsimms@yahoo.com> writes:

>On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
>name wrote:
>
>> in comp.lang.c i read:
>>
>>>Now if wchar_t is not forced to able to contain a full character then
>>>again we are stuck at our multibyte (multi-some-unit) character
>>>sequence with all of its inconveniances. This IMHO defeats the whole
>>>purpose of wchar_t.
>>
>> wchar_t is required to have a range that can handle all the code points
>> which can arise from the use of any locale supported by the implementation.
>> c99 takes this further: the implementation can indicate to the programmer
>> if iso-10646 is directly supported (though the encoding is *not* required
>> to be ucs-4)
>
>I guess you're saying the encoding is not required to be ucs-4 because
>the standard doesn't explicitly say so:
>
> 6.10.8.2
> ...
> __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
> example, 199712L), intended to indicate that values of type wchar_t
> are the coded representations of the characters defined by ISO/IEC
> 10646, along with all amendments and technical corrigenda as of the
> specified year and month. ^^^^^^^^^
     ^^^^^^^^^^^^^^^^^^^^^^^^
>But if the encoding is not ucs-4, then what could it possibly be?
>7.17.2 says
>
> wchar_t which is an integer type whose range of values can represent
> distinct codes for all members of the largest extended character set
> specified among the supported locales;

Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
as being "the largest extended character set specified among the
supported locales" and, therefore, having wchar_t defined as char?

>As I read this, it means that in implementations implementing ISO 10646
>must have a wchar_t capable of representing over 1 million distinct
>values.

It depends on the actual value of the __STDC_ISO_10646__, which could
point to an earlier version of ISO 10646, or not be defined at all,
as in my ASCII example above.

>Given this requirement, ucs-4 seems to be the only reasonable
>encoding to use for ISO 10646 wide character strings.

If the implementation chooses to support a recent enough version of the
ISO 10646. Which the standard allows but doesn't require. The first
incarnation of ISO 10646 only specified 34203 characters, so a 16-bit
wchar_t would be enough for an implementation defining __STDC_ISO_10646__.

>Would an implementation that used utf-8 encoding in wide character
>strings composed of 32-bit wchar_t be conforming?

No way. utf-8 encodings need not fit in a 32-bit wchar_t (they take one
to six octets). They are clearly intended to be used in multibyte
character strings, which are composed of plain char's (e.g. printf's
format string).

Dan

-- 
Dan Pop
DESY Zeuthen, RZ group
Email: Dan.Pop@ifh.de


Relevant Pages

  • mined: Unicode text editor back for minix?
    ... Mined provides both extensive Unicode and CJK support offering many ... specific features and covering special cases that other editors ... of terminal variations, or Han character information). ... Versatile character encoding support ...
    (comp.os.minix)
  • Unicode text editor mined 2000 release 14
    ... Mined provides both extensive Unicode and CJK support offering many ... New command Alt-x toggles preceding character and its hexadecimal code. ... just determines and displays terminal encoding. ... supporting wide range of terminals ...
    (comp.editors)
  • Unicode text editor mined 2000 release 14
    ... Mined provides both extensive Unicode and CJK support offering many ... New command Alt-x toggles preceding character and its hexadecimal code. ... just determines and displays terminal encoding. ... supporting wide range of terminals ...
    (de.comp.editoren)
  • Re: Multibyte string length
    ... > I guess you're saying the encoding is not required to be ucs-4 because ... > encoding to use for ISO 10646 wide character strings. ... but while support is in place for ...
    (comp.lang.c)
  • Re: XML file and & character as data
    ... I have a XML file using Iso 8859-1 encoding. ... How should people code & character as an ordinary data character? ...
    (comp.text.xml)