Re: Unicode Delphi Win32 - which approach



"m. Th." <a@xxxxx> wrote:

What are, in your opinion, the disadvantages of string ( := UTF-16) compared
with string ( := UTF-8)?

I like the backwards compatibility aspects of UTF-8 vs UTF-16. While the
UTF-8 encoding is different from ANSI, at least it's still byte oriented
like 'most' streams of data. Also there's the space saving aspects. In
general UTF-8 is a clever piece of design and tight architecture, a good
way to encode multiple width character sets.

Also I appreciate the fact that by using UTF-8, a non fixed width
encoding, programmers will be forced to "think" Unicode, and not
incorrectly assume that Unicode = 2 byte character set.

Because we are mainly on Windows (at least for the time being) I'd rather prefer
an UTF-16 encoding. It seems a more strategical approach but I don't know what
work implies this in the inners of VCL.

Good point. Also I think Delphi.Net and .Net in general is all based on
UTF-16. (and let's face it, this will be the main reason why CodeGear
will be looking towards UTF-16)

Endianness: The Windows native.

Again, with UTF-8 we'll never even need to make that distinction.

As an aside, also Java and Mac OSX uses UTF-16. Also, on Linux side Qt uses it.
It seems that it will be the future.

Yep. However, in terms of source level compatibility ideally there
really shouldn't be any difference in source code using UTF-16 and UTF-8
encoding.

Unicoding Delphi is not a trivial task. There's so many considerations.
Old code can't be broken. Unicode creeps into so many unexpected places.
(every tried to Zip a Unicode filename with Winzip?)

Then again, the OS has been almost 100% Unicode based ever since NT4. So
there's no excuse for Delphi not to embrace Unicode 100%.

For a programmer I think the biggest change will be the need to mentally
and explicitly contextualise every string.

Beforehand most programmers didn't even think consciously about what was
"in" a string, implicitly assuming that it was just a byte string of
(ANSI) characters. And now we need to move to an extended concept.
.



Relevant Pages

  • Re: Unicode string libraries
    ... encoding negotiation. ... old languages which have adopted Unicode without much pain. ... compatibility with too many old programs; but char as a holder for UTF-8 ... The limitations of UTF-16 ...
    (comp.programming)
  • Re: convert from utf-8 to unicode(excel)
    ... Is there a possibility to properly convert under Windows from utf-8 ... encoding to unicode ... There is no problem in conversion when I do it in Notepad. ... a file marking encoding as UTF-8 and then save it marking encoding as ...
    (comp.editors)
  • Re: Unicode string libraries
    ... UTF-8 is the encoding that must be used ... I initially thought that the variable-length characters ... but also that UTF-8 didn't break when Unicode got extended ...
    (comp.programming)
  • Re: unicode in ruby
    ... doesn't support unicode strings natively? ... (When Unix filesystems can write UTF-16 as ... to use decomposed characters instead of composed characters (e.g., ... even compress repetitive text which no encoding can. ...
    (comp.lang.ruby)
  • Re: Case-sensitivity as option?
    ... Code points beyond 0x10FFFF cannot be encoded with UTF-16, ... it is unlikely that Unicode will ... Windows to UTF-8. ... encode them with normal surrogates. ...
    (comp.lang.forth)