Suggested Alternative Unicode Implementation (for Rudy+ misc others)




Without going into the whys and wherefores behind this post, here for
the benefit of anyone that missed it before, is one idea for an
alternative Unicode implementation in Tiburon that would avoid many of
the pitfalls that the implementation (as currently described) is going
to encounter (and cause).


The suggestion/idea:


Extend String RTTI (for the purposes of this post, RTTI here refers to
the runtime properties of a string, i.e. Length and Reference Count).

String RTTI would be extended to include encoding information. For
access efficiency this would likely be a 32-bit value.

Some encoding values would be reserved for specific system
interpretation, representing:

UTF8
UTF16
UTF32
ANSI (system cp)

Remaining values would identify a specific codepage of an ANSI encoded
string.

i.e. at the implementation level, there would continued to be only one
actual "type" of string, but the formal type of any given instance of a
String would include it's encoding.

There would exist, for the purposes of declarations in code:

UTF8String
UTF16String
UTF32String
ANSIString

and

String


String would "map" to one of the string types based on a project
setting. i.e. for an existing application one would most likely choose
to continue with String => ANSIString, but for a new application one
could choose to map String to the UTF encoding of Unicode most
appropriate to that applications needs.


RTL support for strings would be extended to incorporate appropriate,
implicit transcodings. For ANSI => Unicode these would be lossless. For
Unicode => ANSI the compiler could emit a warning.

Specific transcoding support would provide the means for addressing such
warnings if it were not desirable to simply disable that warning in a
project.

e.g. given that the VCL would be fully Unicode

var
s: ANSIString; (or String where String => ANSIString)

s := Edit1.Text; // WARN: Implicit conversion from Unicode to ANSI


The warning could be addressed by either:

- Changing the declaration of 's' to any Unicode string type
(UTF8, 16, 32)

or

- Utilising an explicit transcode:

s := UnicodeToANSI(Edit1.Text);
or
s := UnicodeToANSI(Edit1.Text, cp1251); // etc


or

- Disabling the warning in the project options (likely to be
acceptable for the majority of existing ANSI applications)


Note that explicit transcoding for ANSI=>Unicode is not required (in
order to address warnings) since such transcoding could be lossless
thanks to the specific codepage of the source and the required UTF
encoding of the destination, being able in the RTTI, and so would not
require any warnings:

e.g.

Edit1.Text := s; // Edit1.Text is UTF16, codepage of ANSI s
// is in RTTI. Compiler silently injects RTL
// transcoding for lossless conversion





In general, the only encoding characteristic of a string that may be
changed would be the codepage of an ANSI string.

It would not be possible to otherwise change the encoding of a string
"in place". Attempting to do so, or attempting operations that rely on
it being possible, would result in a compilation error:

i.e.

var
s: UTF8String;

s := UnicodeToANSI(Edit1.Text); // ERROR: Incompatible types





That's covered the basics I think. I'm running out of time (now gone
5pm on a Friday afternoon and I have to go collect my daughters from
after school care).


IANACW, so I would prefer it if people commenting on the idea could
concentrate on the idea and NOT on nitpicking about what is or isn't
"RTTI", what is or isn't an "encoding", what is or isn't "transcoding"
etc etc.

If any questions arise from inappropriate use of such terminology kindly
restrict comments on that score to clarifying for others, if such
clarification is genuinely needed.


Enjoy,

Jolyon Smith
.



Relevant Pages

  • Re: eval and unicode
    ... encoding your terminal/file/whatnot is written in. ... you have a byte string that starts with u, then ", then something ... The first item in the sequence is \u5fb9 -- a unicode code point. ...
    (comp.lang.python)
  • Ruby, Unicode - ever?
    ... Why can't ruby use at least ICU libs? ... proper Unicode support, don't try to cheat me, that it's OK and enough, ... Ruby String class in current state is TOO MUCH OVERLOADED: ... encoding is senseless - this is plain bit stream. ...
    (comp.lang.ruby)
  • Re: Why asci-only symbols?
    ... >> Perhaps string equivalence in keys will be treated like numeric equivalence? ... I know typewill be and in itself contain no encoding information now, ... >and a Unicode string, the system default encoding ...
    (comp.lang.python)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... and strings in Unicode - many modern languages allow it. ... proprietary half-baked encoding that is incompatible with every other tool ... tools for this new language whose codes will never be seen by its users. ... the effective string length is 1.0x or rare ...
    (comp.arch.embedded)
  • Re: Unicode drives me crazy...
    ... every string on some level). ... Python needs to know what encoding is used. ... The decode instruction converts s into a unicode string - where Python ...
    (comp.lang.python)