Unicode Delphi Win32 - which approach



Seeing the fairly strong interest for a native Unicode VCL in Delphi for
Win32, I wonder about the best approach for implementing this.

Considerations:

1. I think it would be fair to say that Unicode enabling existing code
is a major priority; few would tolerate only newly developed
applications to support Unicode. In other words enabling Unicode support
in existing applications shouldn't require extensive rewrites.

2. On the other hand, it would also seem fair to expect at least minor
adjustments in existing code to support Unicode. Since a byte/word no
longer corresponds to a single Unicode character, some adjustments are
inevitable, adjusting your string processing to proper iterators etc.
Not forgetting that old code that relies on various codepage tricks to
handle multi languages will need to be phased out and reworked into
Unicode. In short, to truly support International character sets, a
minimum amount of work should be expected and tolerated. Only completely
plain applications should require no reworking at all.

3. In the generic sense, there isn't any material difference between an
ANSI application and a Unicode application, since the only real
difference is the string encoding. So I don't think we should have a
"Create Unicode Application" option in the IDE.

4. Full native first class Unicode support is a must. The IDE code
editor, resource editors, resource files, component design time
properties etc. All external tools will need to adapt to Unicode source
too (obviously outside CodeGear's control): Peganza Pascal Analyzer,
Multilizer, diff tools like Beyond Compare, the source viewer in
profiling tools like AQTime, Help file editors etc.

5. Enabling Unicode support in all built-in components is a given.
Enabling Unicode support in 3rd party components like TeeChart and
Report Builder would be very desirable. For the component vendors: The
easier the better.

6. It should still be possible to continue writing plain ANSI apps.

7. Somewhat controversially I'd suggest dropping any kind of Unicode
support for Win95/98 and only focus on OS platforms with true Unicode
support. (NT, 2000, XP, Vista etc) This goes both for the produced EXE's
and the IDE.

8. Actually,...personally I'd also drop any kind of old style
codepage/MBCS support as well, simply to encourage anyone writing
internationalised applications to move 100% to Unicode and abandon all
the old stuff completely.

9. The executables should degrade gracefully on 95/98 and run as plain
ANSI. Any Unicode text will output as garbled text, whereas ANSI text
will still render correct as usual.

Given all these, it would seem that...

1. There shouldn't be Unicode & Non-Unicode components. There's only one
TEdit, and it will support both. It will render .Text as ANSI on 95/98
(no matter the string encoding) and Unicode on later OS'es.

2. String ENCODING and TYPE. I think this is the key decision:

2a. We don't want to introduce a new String TYPE.

2b. Introducing and requiring for instance a new "UniString" type would
require a massive rewrite of all existing code. Same problem with the
existing WideString type, which also has the drawback that its
originally designed for a different purpose and isn't ref counted.

2c. With a compiler directive the existing "String" type could be made
identical to "WideString". However, WideString is not ref counted and it
may also lead to nasty errors if people continue coding as if Unicode
was a fixed width encoding obscured by the fact that WideString = UCS-2
<> UTF-16. Better to have them start using the new character iterators
necessary for Unicode.

Encodings & types:
----------------------

A. Encoding everything as UTF-8 into the existing String type.

This has some major advantages:
- We can continue using the existing String type, thus no major code
rewrites.
- New string iterators will need to be added, although for instance we
might want a compiler setting to signify that [] now iterates the
characters and not the bytes.
- Storage efficiency is great, even for many character sets like
Chinese. (compared to UTF-32)
- Easier transition: ASCII is a proper subset of UTF-8.
- Sorting UTF-8 is easy.
- A lot of external Unicode data sources are encoded as UTF-8. (most of
the Web for one)
- UTF-8 is a single format. There's no possibility of LE/BE confusion
and similar encoding confusion.

Disadvantages:
- All Win32 API calls will need translation. Not by the user but the VCL
when interfacing into the API. Major VCL work involved I'd imagine.
Mitigating factors:
-- API calls aren't that frequent, since they usually just entail the
things you "see on the screen" and not your entire data set.
-- UTF-8 <-> UTF-16 conversion algorithm is pretty fast

B. Encoding everything as UTF-16 into the existing String type:

Advantages:
- UTF-16: Straight interface to the entire Win32 API and many database
engines.
- UTF-16: Interoperability with .Net code.

C. Encoding everything as UTF-16 into the existing WideString type

D. Encoding everything as UTF-16 into "String", but compiler substitutes with WideString in the background.

------------------------

4. What about the Char TYPE. I guess this will no longer become a fixed
width set of bytes. Unless we look at this type a little bit more
pragmatically and just retain the "old" meaning of Char = Byte.

5. All old source/dpr/text dfm will be converted to Unicode (UTF-8/16)
when opened in Unicode Delphi.

Ok. So what's your take on all this?

Personally I'd probably go for UTF-8. It just seems like the simplest
and best solution. But that's just my opinion.

Not introducing a new string type seems like the strongest criteria,
thus making sure that all existing 3rd party components can be made
Unicode compatible as easily as possible.
.



Relevant Pages

  • Re: How long would it take to build Win64/Unicode Delphi?
    ... support for 64 bit registers ... How much will the VCL actually need to change? ... What is involved in Unicode VCL support - again, ... Change all string handling routines in the VCL to support the ...
    (borland.public.delphi.non-technical)
  • CMUCL 20a released
    ... The CMUCL project is pleased to announce the release of CMUCL 20a. ... The major change in this release is support for Unicode. ... There is only one string type; ...
    (comp.lang.lisp)
  • Re: Unicode Support
    ... > compiler/assembler is unicode support. ... > but the fact that most assemblers, ... Most popular *editing* tools support this data format. ... Is there really a need to put unicode characters into identifiers? ...
    (alt.lang.asm)
  • Unicode text editor mined 2000 release 11
    ... Mined provides both extensive Unicode and CJK support offering many ... of terminal variations, or Han character information). ... It was the first editor that supported Unicode in a plain-text terminal. ...
    (comp.os.linux.announce)
  • Re: Unicode-based FreeBSD
    ... displaying specialised characters on the screen/tty. ... There are special Input Methods for the rest of Unicode. ... Unicode support and the FreeBSD developers see little reason to ...
    (freebsd-current)