Re: Unicode Delphi Win32 - which approach
- From: "m. Th." <a@xxxxx>
- Date: Sat, 09 Jun 2007 14:02:23 +0300
Help wrote:
"m. Th." <a@xxxxx> wrote:
What are, in your opinion, the disadvantages of string ( := UTF-16) compared with string ( := UTF-8)?
I like the backwards compatibility aspects of UTF-8 vs UTF-16. While the
UTF-8 encoding is different from ANSI,...
From the Delphi 2007's help at ms-help://borland.bds5/devwin32/intappswidecharacters_xml.html
(Wide characters)
<quote>
The first 256 Unicode characters map to the ANSI character set. The Windows operating system supports Unicode (UCS-2). The Linux operating system supports UCS-4, a superset of UCS-2. Delphi supports UCS-2 on both platforms.
</quote>
...at least it's still byte oriented
like 'most' streams of data. Also there's the space saving aspects. In
general UTF-8 is a clever piece of design and tight architecture, a good
way to encode multiple width character sets.
Space saving imho isn't a concern for us, as developers in our discussion theme. How many Unicode strings we'll store? I opened now a Word 2003 document containing text in ancient Greek having 80 pages, 4146 paragraphs, 4631 lines, 206065 characters, 240795 characters (with spaces). The file size is 1.4 MB (in UTF-16). In UTF-8 how many we'll save? 0.7-0.5 MB? Taking in consideration that we are on Windows (and hence not in some extreme embedded systems) and so the storage space is abounding (the smallest storage media which I see now is a USB flash memory which is now currently 1GB+) I think that this shouldn't be a matter.
Also I appreciate the fact that by using UTF-8, a non fixed width
encoding, programmers will be forced to "think" Unicode, and not
incorrectly assume that Unicode = 2 byte character set.
Puah! :-)
In fact you are right, but you must think that the programmers in the real world doesn't have time to think their programming jobs/tasks. The TTM (time to market) is a crucial factor for them (and for us, isn't?). Forcing them to think in the "inner workings" (instead of 'hiding' them) I think that it will not bring us many praises from the community.
Because we are mainly on Windows (at least for the time being) I'd rather prefer an UTF-16 encoding. It seems a more strategical approach but I don't know what work implies this in the inners of VCL.
Good point. Also I think Delphi.Net and .Net in general is all based on
UTF-16. (and let's face it, this will be the main reason why CodeGear
will be looking towards UTF-16)
Endianness: The Windows native.
Again, with UTF-8 we'll never even need to make that distinction.
As an aside, also Java and Mac OSX uses UTF-16. Also, on Linux side Qt uses it. It seems that it will be the future.
Yep. However, in terms of source level compatibility ideally there
really shouldn't be any difference in source code using UTF-16 and UTF-8
encoding.
Unicoding Delphi is not a trivial task. There's so many considerations.
Old code can't be broken. Unicode creeps into so many unexpected places.
And also, we have a very important Unicode code-base already. All the components which use 'unicode' now use the 'WideString' approach and WideString is UTF-16 AFAIK. (see the above quote, even it seems that isn't so correct). Personally speaking, I use (flawlessly) TVirtualTree, FastReport and TNT Unicode Controls.
Why don't you make String := <reference counted> WideString?
Then you'll have ready the .dfm streaming engine ready, comparison routines ready (hence sort), already some tooling (routines aso.) and the above libraries (except TNT, of course) will work smoothly in a new Delphi. After doing this, you can Unicode-enable Delphi incrementally, (as you did till now). First was the DBX, now the 'Standard' palette (following) the TNT's example. Perhaps even in Highlander? Why do you want to do this one-off approach?
About 'Unicode creeps into so many unexpected places.': You see, TNT was a freeware, one man's work. TVirtualTree pack, is a (stunning) freeware, one man's work. CodeGear is an entire company. If you release a public beta, you'll have much more field testers than Mike or Troy.
Then again, the OS has been almost 100% Unicode based ever since NT4. So
there's no excuse for Delphi not to embrace Unicode 100%.
I never said this. Also, I stressed that *this* is one of the main reasons for UTF-16. I hardly see a coder doing a for i:=1 to 100 do <read an entire stream> but calling an API 100 times in a loop I can imagine. (GDI for ex). Again in a normal usage the frequency of UTF-16 API calls (even through VCL or directly) should be much higher than the frequency of reading/writing the 'external' data stream. And also for a stream (ie. an external data source) one cannot say what encoding will be. Also you say '_most_ of the sources' so a check/conversion layer (to UTF-8 or UTF-16) MUST exist. Talking to the OS will be always in UTF-16. Why do you want to add also here conversion? Also, there are 7000 new API calls for Vista. Do you want to do the conversion for each call? Your conversion code will be very sensible here and very hard to maintain, imho. Another point is the marginal cost. A conversion cost which is added to reading an external, indeterministic stream from 'outside' is very small due to media where this stream is located (hard-disk, flash, LAN, WAN), media which has a much lower throughput than memory (which is the 'media' for an API call). (ie. how much time is spent to read a 'big' stream? Let's say 5 seconds. The conversion time for this big stream: not even 1%. Who will observe? OTOH, on an API call because the conversion time can be a much higher percent from the 'usefull' time and because the calling frequency here is much higher leads us to a much higher marginal cost (ie. cost per unit), so in conclusion, this leads us to avoid conversion here.)
For a programmer I think the biggest change will be the need to mentally
and explicitly contextualise every string.
Beforehand most programmers didn't even think consciously about what was
"in" a string, implicitly assuming that it was just a byte string of
(ANSI) characters. And now we need to move to an extended concept.
Yes and no. The Pascal is verbose. Is conceptual, abstract. The vast majority of programmers must not think what is 'underneath'. This is one of the main strengths of Pascal. Usually, as a Pascal programmer I don't care how many _bytes_ are in s:='Καλή μέρα'. I speak from my little experience. I wrote all my programs using a non-Latin alphabet for strings and never I had problems with DBCS or such. What is needed is Length(s)=9, s[1]:='K', correct transformation functions (AnsiUpperCase, AnsiLowerCase etc. - 'Ansi' ...hehehe...). (Regarding to another question posted by you in another message 'char' means 'Character' ie. means 'Κ' not _byte_ (ie. something in 0..255 range). Assuming that a Char = Byte is the same with assuming that 'Real' has 48 bits. Not Pascal, imho.
BTW, nice discussion,
hth,
m. th.
.
- Follow-Ups:
- Re: Unicode Delphi Win32 - which approach
- From: Arthur Hoornweg
- Re: Unicode Delphi Win32 - which approach
- From: Help
- Re: Unicode Delphi Win32 - which approach
- From: Markus Spoettl (toolsfactory)
- Re: Unicode Delphi Win32 - which approach
- References:
- Unicode Delphi Win32 - which approach
- From: Help
- Re: Unicode Delphi Win32 - which approach
- From: m. Th.
- Re: Unicode Delphi Win32 - which approach
- From: Help
- Unicode Delphi Win32 - which approach
- Prev by Date: Re: The New Roadmap
- Next by Date: Re: Delphi and C++Builder Roadmap
- Previous by thread: Re: Unicode Delphi Win32 - which approach
- Next by thread: Re: Unicode Delphi Win32 - which approach
- Index(es):
Relevant Pages
|
|