Re: [OT] Re: wchar_t



"Skarmander" <invalid@xxxxxxxxxxxxxx> wrote in message
news:437f5677$0$11075$e4fe514c@xxxxxxxxxxxxxxxxx

> I'll mark it OT, since we've left C behind quite a bit by now.

Not entirely, since this discussion goes to the very heart of
why we (X3J11) made wchar_t a flexible type, much to the dismay
of the various jingoists who know what the *right* representation
should be. (Hint: they don't all agree.)

>> Here's a coarse scale or two, just from personal experience.
>>
>> -- Number of address bits required to address a "large" memory:
>>
>> 1960 15 IBM 7090
>> 1970 20 IBM 360
>> 1980 25 VAX 11/780
>> 1990 30 various
>> 2000 35 various
>>
> Nice, but this misses a point: there is an upper limit.

Okay, what *is* that upper limit? That *was* the point.
Does anybody dare freeze it now?

> Address bits will
> not continue to grow indefinitely, because there is an upper limit to the
> amount of information that will fit in the universe. Or maybe there isn't,
> but then we're talking a radical shift in physics, which may happen but
> doesn't allow for fair comparison anymore.

Good. Now tell me the practical upper limit that we can use
to standardize the all-singing, all-dancing physical address
for now and all future times.

>> -- Number of bits required to represent a (commonly used)
>> character set:
>>
>> 1960 6 numerous vendor-specific codes
>> 1970 7 7-bit ASCII
>> 1980 8 extended ASCII
>> 1990 16 DBCS and others
>> 2000 21 Unicode
>>
>> I could make a similar table of "barely adequate" communication
>> speeds, which also continue to expand exponentially.
>>
> But again: it can't go on forever. The question here, therefore, is
> whether we've reached the end of the line, not whether exponential
> expansion is happening.

Yes, that's *exactly* the question I raised.

>> So long as you think in terms of linear increases in demand
>> for bytes or characters, it's easy to believe at each stage
>> that you're through expanding. After all, you currently have
>> a bit of headroom, and what possible need can there be for
>> much larger programs/character sets?
>>
> Don't think this question hasn't been asked,

I indeed *don't* think that. In fact, I believe I said something
quite along those lines.

> unlike those people who
> asserted that "640K ought to be enough for anybody" (which Bill Gates
> famously never said) or "16 bits ought to be enough, since it's better
> than wasting 32 bits". Unicode doesn't say "21 bits ought to be enough for
> anybody". It can say "21 bits is enough for every character known to man",
> because it is. Unlike memory, communication speed and a host of other
> things that keep growing, there is a conceivable upper limit, and it is
> not that unreasonable to state we're close to it.

It may not be unreasonable, but I maintain that, on the basis of
history, it's wildly optimistic. IIRC, SC2/WG2 (the ISO committee
corresponding to the Unicode Consortium) even saw fit to pass
a resolution that UTF-16 will forever more be adequate to express
all expansions of ISO 10646 (the ISO standard corresponding to
Unicode). I consider that either a) a mark of remarkable self
confidence, or b) whistling in the dark. Take your pick.

>> I personally can't imagine that people will ever want to
>> define common attribute bits for, say:
>>
>> -- roman, italic, bold, underscore
>> -- red, green, blue
>> -- point size
>> -- font
>>
>> But if we did, each attribute bit would double the number
>> of effective character codes, wouldn't it?
>>
>
> That's why Unicode doesn't work that way, and no character set ever has.
> They encode *characters*, not *glyphs*.

I do understand that. Admittedly, the example of one possible
cause for exponential expansion was a lightning rod.

> The point is, effective comparison stops being useful at this point,
> because you've shifted the way you look at what a code point represents.
> As the Unicode FAQ itself states:
>
> "Both Unicode and ISO 10646 have policies in place that formally limit
> future code assignment to the integer range that can be expressed with
> current UTF-16 (0 to 1,114,111). Even if other encoding forms (i.e. other
> UTFs) can represent larger intergers, these policies mean that all
> encoding forms will always represent the same set of characters. Over a
> million possible codes is far more than enough for the goal of Unicode of
> encoding characters, not glyphs. Unicode is not designed to encode
> arbitrary data. If you wanted, for example, to give each 'instance of a
> character on paper throughout history' its own code, you might need
> trillions or quadrillions of such codes; noble as this effort might be,
> you would not use Unicode for such an encoding."

So I did RC. The question I raised, however, was whether Unicode can
resist the inevitable pressures to grow beyond their currently
self-imposed barrier of 1,114,112 codes. Again IIRC, the Unicode
Consortium parted company with SC2/WG2 years ago because the former
body was convinced that 65,536 codes would be enough and the latter
was intent on leaving room for 2^31. Microsoft and Sun backed that
play, with Windows and Java (among other products) and now they
have to wrestle with the inconvenience of UTF-16. BTW, I haven't
noticed anybody in the Unicode camp blushing at their earlier
hubris.

>> Nor can I imagine that a large government like China might
>> thumb its nose at an international standard and, say,
>> require a parallel set of many ISO 10646 codes.
>>
> It already thumbs its nose to some extent. Unicode is still viewed with
> great suspicion in some parts of the Eastern world, and alternate
> character sets continue to be in use. But the Chinese government can
> require of ISO 10646 what it wants; it's not likely to get it if it can't
> be supported by technical requirements, as opposed to politics.

Oh, my, I think you really believe that. When "politics" is backed
by the odd billion dollars worth of contracts, you'd be surprised
what it can get.

>> For over 40 years I've been reading regular articles by
>> pundits who explain why larger/faster hardware is a waste
>> of time and will never sell. They've all been wrong. And
>> the further back in time you look, the greater the redshift
>> in the predictions.
>>
> These arguments do not cleanly translate to character sets, your little
> tables notwithstanding. The upper limit may not be 21 bits, but if that's
> not the upper limit, it's pretty close to it in orders of magnitude.

Okay. My "argument" was that 21 bits will not long prove to be enough.
Just one order of magnitude will be enough to blow UTF-16 to kingdom
come. And that was my point.

>> So, you may well be right that the need for larger
>> character sets has finally come to an end. I'll wait
>> and see. Meanwhile, I make sure that the code I write
>> will work with 32- (not 31-) bit character sets. With
>> any luck, the code will have adequate capacity until
>> I retire...
>>
> Fortunately for you, writing code that can handle both 21-bit and 32-bit
> character sets is hardly a challenge, given the current state of computer
> hardware. Even if Unicode had to grow someday (which would have to mean a
> new standard, of course), it wouldn't exactly be hard to implement, at
> least not as far as code point size is concerned.

Also my point. Having just survived several years of UTF-16
jingoism, however, I expect to be ungracious if Unicode does
indeed have to issue a new standard that leaves UTF-16 in the
same rest home as UCS-2. I also hope to remain intellectually
honest enough to issue a mea culpa in five years if I prove
to be wrong.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


.



Relevant Pages

  • Re: [OT] Re: wchar_t
    ... Now tell me the practical upper limit that we can use to standardize the all-singing, all-dancing physical address for now and all future times. ... Consortium) even saw fit to pass a resolution that UTF-16 will forever more be adequate to express all expansions of ISO 10646 (the ISO standard corresponding to Unicode). ... and a new standard would be required. ... But memory size and character sets are different things. ...
    (comp.lang.c)
  • Re: Quieter glyphs than parentheses
    ... partial support for Unicode. ... agnosticism, it imposes requirements that a Unicode-supporting ... While R6RS can remain agnostic about character sets, ...
    (comp.lang.lisp)
  • Re: Edit Control
    ... Unicode character set at this time. ... TCHAR and the corresponding _Tare actually macros that are there to help ... between unicode and non-unicode character sets. ... Open the properties and navigate to Configuration Properties> ...
    (microsoft.public.vc.language)
  • Unicode As Weak Cryptography
    ... Unicode 6.0 tries to help all cultures on all computers with more than ... Unitard re-codes that tower of unbabble to become ... weak cryptography. ... have several codes to defeat frequency analysis. ...
    (sci.crypt)
  • Re: Unicode compliant gzip?
    ... ignore the fact there users were using other character sets. ... Only the Unicode character set is supported, ... I can then run gzip from the command line with that file ...
    (comp.compression)