Re: Unicode in Delphi: just deprecate WideString/WideChar



UTF8 is how we're currently dealing with Unicode here, essentially because it's the only efficient string currently in Delphi, that said...

There might be some problems when trying to use the index [] or doing copy commands on returning partial UTF8 characters, but these are relatively minor.

These aren't so minor, actually they're the reason a very large proportion of string-manipulating .Net applications out there aren't Unicode capable, and only deal well with UCS-2.
It happens not only in the form of trimming/cutting at the wrong place, but also under the assumption that System.Char can hold a character, and then f.i. using it as function parameter.

Mostly all these are used to parse strings, and the parsing tokens are usually in the 0..127 range (so 1 byte = 1 utf8 char).

Don't underestimate the legions of developpers that when they don't find the function they need in a 30sec search proceed to implement their own...
Granted, with UTF8 at least every developper will expect variable-lentth characters, and will take precautions (that most don't take when dealing with UTF16), so given a choice between full WideString, fullUTF and UTF-8 only I would pick UTF8. For the Delphi side, that's good, but...

I really don't see any need to support UTF16.

....Windows interfaces are in UTF16, so you have to convert to UTF16 and back everytime you call them. Conversion is reasonnably fast, but it results in the need to wrap every call.
Right now for us and UTF8, this is a necessity arising from the lack of Unicode support in Delphi, but IMO for a "Unicode-compliant Delphi" this would be quite a shame not to be able to have and use UTF8/UTF16/UTF32 string types directly.
The other side of the coin is to be able to expose DLL and Interfaces with UTF16 strings parameters to other applications.

> Almost all big text documents around nowadays are in UTF8, simply because
> it is almost always more economical. The only languages where there might
> be a slight size increase of UTF8 vs. UTF16 are Japanese and Chinese.

From what we encountered -the largest strings were XML- UTF8 is still more compact thanks to the legions of tags and other xml bits (which are < ASCII 128).


Eric
.



Relevant Pages

  • Re: accessing individual characters in unicode strings
    ... mailer can see utf8) ... each character (actually set a width attribute somewhere else for each ... So I use lento find out how long my simple greek string is, ... A day of intensive searching around the lists tells me that unicode ...
    (comp.lang.python)
  • Converting text between various encodings
    ... I'm playing with converting text strings between various encodings like ... Unicode and UTF8 and UTF7. ... a string to be converted and a long integer ...
    (microsoft.public.scripting.vbscript)
  • Re: LWP and Unicode
    ... until you understand Perl's Unicode handling better. ... Isn't there a way to tell LWP that the content is utf8? ... encoding supports many encodings. ... If the string already has the UTF8 flag on, ...
    (comp.lang.perl.misc)
  • Re: more DBD::Oracle utf8 weirdness, and kludge that should not have worked, but did
    ... And what are the _client_ CHAR and NCHAR character sets? ... It's important to keep in mind that "validates as utf8" is ambiguous. ... If a sequence of bytes that does not have the SvUTF8 flag turned ... Latin1 character will produce garbage unless the string is all ASCII. ...
    (perl.dbi.users)
  • Re: Character semantics for filenames (was: win32 reading wide filenames (unicode))
    ... Now file name is stored in utf8 format. ... it doesn't make any difference whether the string is internally ... DO WITH CHARACTERS ABOVE "\xFF". ... encoding to perl strings by readdir and from perl strings to the OS ...
    (comp.lang.perl.misc)