Re: Unicode Delphi Win32 - which approach
- From: "Arthur Hoornweg" <antispam.hoornweg@xxxxxxxxxxxxx>
- Date: Sun, 10 Jun 2007 14:49:58 +0200
--
To answer me, please just remove the .NET from my e-mail address.
"m. Th." <a@xxxxx> schreef in bericht
news:466a88c1@xxxxxxxxxxxxxxxxxxxxxxxxx
Help wrote:
"m. Th." <a@xxxxx> wrote:
What are, in your opinion, the disadvantages of string ( := UTF-16)
compared with string ( := UTF-8)?
I like the backwards compatibility aspects of UTF-8 vs UTF-16. While the
UTF-8 encoding is different from ANSI,...
From the Delphi 2007's help at
ms-help://borland.bds5/devwin32/intappswidecharacters_xml.html
(Wide characters)
<quote>
The first 256 Unicode characters map to the ANSI character set. The
Windows operating system supports Unicode (UCS-2). The Linux operating
system supports UCS-4, a superset of UCS-2. Delphi supports UCS-2 on both
platforms.
</quote>
...at least it's still byte oriented
like 'most' streams of data. Also there's the space saving aspects. In
general UTF-8 is a clever piece of design and tight architecture, a good
way to encode multiple width character sets.
Space saving imho isn't a concern for us, as developers in our discussion
theme. How many Unicode strings we'll store? I opened now a Word 2003
document containing text in ancient Greek having 80 pages, 4146
paragraphs, 4631 lines, 206065 characters, 240795 characters (with
spaces). The file size is 1.4 MB (in UTF-16). In UTF-8 how many we'll
save? 0.7-0.5 MB? Taking in consideration that we are on Windows (and
hence not in some extreme embedded systems) and so the storage space is
abounding (the smallest storage media which I see now is a USB flash
memory which is now currently 1GB+) I think that this shouldn't be a
matter.
Also I appreciate the fact that by using UTF-8, a non fixed width
encoding, programmers will be forced to "think" Unicode, and not
incorrectly assume that Unicode = 2 byte character set.
Puah! :-)
In fact you are right, but you must think that the programmers in the real
world doesn't have time to think their programming jobs/tasks. The TTM
(time to market) is a crucial factor for them (and for us, isn't?).
Forcing them to think in the "inner workings" (instead of 'hiding' them) I
think that it will not bring us many praises from the community.
Because we are mainly on Windows (at least for the time being) I'd
rather prefer an UTF-16 encoding. It seems a more strategical approach
but I don't know what work implies this in the inners of VCL.
Good point. Also I think Delphi.Net and .Net in general is all based on
UTF-16. (and let's face it, this will be the main reason why CodeGear
will be looking towards UTF-16)
Endianness: The Windows native.
Again, with UTF-8 we'll never even need to make that distinction.
As an aside, also Java and Mac OSX uses UTF-16. Also, on Linux side Qt
uses it. It seems that it will be the future.
Yep. However, in terms of source level compatibility ideally there
really shouldn't be any difference in source code using UTF-16 and UTF-8
encoding.
Unicoding Delphi is not a trivial task. There's so many considerations.
Old code can't be broken. Unicode creeps into so many unexpected places.
And also, we have a very important Unicode code-base already. All the
components which use 'unicode' now use the 'WideString' approach and
WideString is UTF-16 AFAIK. (see the above quote, even it seems that isn't
so correct). Personally speaking, I use (flawlessly) TVirtualTree,
FastReport and TNT Unicode Controls.
Why don't you make String := <reference counted> WideString?
Then you'll have ready the .dfm streaming engine ready, comparison
routines ready (hence sort), already some tooling (routines aso.) and the
above libraries (except TNT, of course) will work smoothly in a new
Delphi. After doing this, you can Unicode-enable Delphi incrementally, (as
you did till now). First was the DBX, now the 'Standard' palette
(following) the TNT's example. Perhaps even in Highlander? Why do you want
to do this one-off approach?
About 'Unicode creeps into so many unexpected places.': You see, TNT was a
freeware, one man's work. TVirtualTree pack, is a (stunning) freeware, one
man's work. CodeGear is an entire company. If you release a public beta,
you'll have much more field testers than Mike or Troy.
Then again, the OS has been almost 100% Unicode based ever since NT4. So
there's no excuse for Delphi not to embrace Unicode 100%.
I never said this. Also, I stressed that *this* is one of the main reasons
for UTF-16. I hardly see a coder doing a for i:=1 to 100 do <read an
entire stream> but calling an API 100 times in a loop I can imagine. (GDI
for ex). Again in a normal usage the frequency of UTF-16 API calls (even
through VCL or directly) should be much higher than the frequency of
reading/writing the 'external' data stream. And also for a stream (ie. an
external data source) one cannot say what encoding will be. Also you say
'_most_ of the sources' so a check/conversion layer (to UTF-8 or UTF-16)
MUST exist. Talking to the OS will be always in UTF-16. Why do you want to
add also here conversion? Also, there are 7000 new API calls for Vista. Do
you want to do the conversion for each call? Your conversion code will be
very sensible here and very hard to maintain, imho. Another point is the
marginal cost. A conversion cost which is added to reading an external,
indeterministic stream from 'outside' is very small due to media where
this stream is located (hard-disk, flash, LAN, WAN), media which has a
much lower throughput than memory (which is the 'media' for an API call).
(ie. how much time is spent to read a 'big' stream? Let's say 5 seconds.
The conversion time for this big stream: not even 1%. Who will observe?
OTOH, on an API call because the conversion time can be a much higher
percent from the 'usefull' time and because the calling frequency here is
much higher leads us to a much higher marginal cost (ie. cost per unit),
so in conclusion, this leads us to avoid conversion here.)
For a programmer I think the biggest change will be the need to mentally
and explicitly contextualise every string.
Beforehand most programmers didn't even think consciously about what was
"in" a string, implicitly assuming that it was just a byte string of
(ANSI) characters. And now we need to move to an extended concept.
Yes and no. The Pascal is verbose. Is conceptual, abstract. The vast
majority of programmers must not think what is 'underneath'. This is one
of the main strengths of Pascal. Usually, as a Pascal programmer I don't
care how many _bytes_ are in s:='???? ????'. I speak from my little
experience. I wrote all my programs using a non-Latin alphabet for strings
and never I had problems with DBCS or such. What is needed is Length(s)=9,
s[1]:='K', correct transformation functions (AnsiUpperCase, AnsiLowerCase
etc. - 'Ansi' ...hehehe...). (Regarding to another question posted by you
in another message 'char' means 'Character' ie. means '?' not _byte_ (ie.
something in 0..255 range). Assuming that a Char = Byte is the same with
assuming that 'Real' has 48 bits. Not Pascal, imho.
BTW, nice discussion,
hth,
m. th.
.
- Follow-Ups:
- Re: Unicode Delphi Win32 - which approach
- From: Arthur Hoornweg
- Re: Unicode Delphi Win32 - which approach
- References:
- Unicode Delphi Win32 - which approach
- From: Help
- Re: Unicode Delphi Win32 - which approach
- From: m. Th.
- Re: Unicode Delphi Win32 - which approach
- From: Help
- Re: Unicode Delphi Win32 - which approach
- From: m. Th.
- Unicode Delphi Win32 - which approach
- Prev by Date: Re: Unicode Delphi Win32 - which approach
- Next by Date: Re: The New Roadmap
- Previous by thread: Re: Unicode Delphi Win32 - which approach
- Next by thread: Re: Unicode Delphi Win32 - which approach
- Index(es):
Relevant Pages
|