Re: What kind of unicode?



1 - Enable unicode only if you check a box. UTF-8 encoding

What we use now here. A side benefit of UTF-8 rather than UTF-16 is that developers are aware that a single Unicode character can span several AnsiChar, while many forget that the same is true for WideChar.

Also, this isn't exactly fast under Windows, because it will need conversion to UCS2 to pass to Wide windows functions, or to ISO to pass to Ansi windows function.

Speed isn't so much an issue (UTF8<->UTF16 is fast) as the need to wrap every call.
UTF-8 is usually smaller and thus, processed faster. And even when dealing with Chinese XML files, the resulting size is comparable or smaller in UTF-8 than in UTF-16 (because there are so many markers, tags and separators that fit in only one byte in UTF-8, but would take two in UTF-16).

For algorithms, if you treat UTF-16 as UTF-16 and not as UCS-2, the complexity is similar to UTF-8, as UTF-16 doesn't absolve you from the need of dealing with variable length characters.

The "best" solution IMO would be to:
- introduce Unicode string types, UTF8String, UTF16String, UTF32String with automatic casting between the three types, but manual (non-automatic) casting between Unicode types and AnsiString (obviously) and WideString (because it has been treated as UCS-2 too often in existing code, so this string type isn't a safe UTF-16 container)
- UTF8String and UTF16String would not be arrays of characters, but of Byte and Word (no UTF8Char or UTF16Char heresy).
- UTF32String would be an array of UnicodeChar
- components would have Caption, Text and the rest be of type UTF16String (just like Windows)

As for the default aliasing for "String", UTF32String would be the "safe, easy but inefficient" option, UTF8String would be the "safe, complex but efficient", and UTF16String would be "unsafe, complex and so-so efficient" option.

As for UTF16String being unsafe, witness all the .Net code out there for which legions of developers assumed that System.Char could hold any Unicode character (it can't, it can hold only chars from the BMP, ie. UCS-2 chars).

Eric
.



Relevant Pages

  • Re: wstring to ostream
    ... There are different encodings for Unicode characters; UTF-8 and UTF-16 ... a Unicode character can be stored in one or two ...
    (microsoft.public.vc.stl)
  • Re: Non-standard characters on Web
    ... However, obviously someone decided that it would be sensible to have a good encoding that allowed any 16-bit Unicode character to be encoded as a sequence of octets, and that's UTF-8 ... As it happens, Unicode moved beyond 16-bits to a 32-bit standard, and as far as I know, every Unicode character can be represented as a unique 32-bit number, and can also be represnted using either UTF-8 or UTF-16 encoding. ...
    (microsoft.public.mac.office.word)
  • Re: Non-standard characters on Web
    ... UTF-8 and UTF-16 are both just encodings primarily intended for compression- either of them can be used to encode any Unicode character. ...
    (microsoft.public.mac.office.word)
  • Re: Defacto standard string library
    ... Unicode Character could match a string of several ordinary chars). ... the result is likely to be meaningless. ... that it can be compared against ascii, since ascii is a subset of UTF-8. ...
    (comp.lang.c)
  • Re: Text::Wrap and unicode
    ... "Unicode character" is an abstract concept, ... It's impossible to talk about that abstract concept in practical terms ... There exist definitions for how to use 8-bit units (utf-8), ... Encoding Scheme" are elucidated. ...
    (comp.lang.perl.misc)