Re: What kind of unicode?
- From: Eric Grange <egrangeNO@xxxxxxxxxxxxxxx>
- Date: Fri, 13 Oct 2006 14:06:01 +0200
1 - Enable unicode only if you check a box. UTF-8 encoding
What we use now here. A side benefit of UTF-8 rather than UTF-16 is that developers are aware that a single Unicode character can span several AnsiChar, while many forget that the same is true for WideChar.
Also, this isn't exactly fast under Windows, because it will need conversion to UCS2 to pass to Wide windows functions, or to ISO to pass to Ansi windows function.
Speed isn't so much an issue (UTF8<->UTF16 is fast) as the need to wrap every call.
UTF-8 is usually smaller and thus, processed faster. And even when dealing with Chinese XML files, the resulting size is comparable or smaller in UTF-8 than in UTF-16 (because there are so many markers, tags and separators that fit in only one byte in UTF-8, but would take two in UTF-16).
For algorithms, if you treat UTF-16 as UTF-16 and not as UCS-2, the complexity is similar to UTF-8, as UTF-16 doesn't absolve you from the need of dealing with variable length characters.
The "best" solution IMO would be to:
- introduce Unicode string types, UTF8String, UTF16String, UTF32String with automatic casting between the three types, but manual (non-automatic) casting between Unicode types and AnsiString (obviously) and WideString (because it has been treated as UCS-2 too often in existing code, so this string type isn't a safe UTF-16 container)
- UTF8String and UTF16String would not be arrays of characters, but of Byte and Word (no UTF8Char or UTF16Char heresy).
- UTF32String would be an array of UnicodeChar
- components would have Caption, Text and the rest be of type UTF16String (just like Windows)
As for the default aliasing for "String", UTF32String would be the "safe, easy but inefficient" option, UTF8String would be the "safe, complex but efficient", and UTF16String would be "unsafe, complex and so-so efficient" option.
As for UTF16String being unsafe, witness all the .Net code out there for which legions of developers assumed that System.Char could hold any Unicode character (it can't, it can hold only chars from the BMP, ie. UCS-2 chars).
Eric
.
- References:
- What kind of unicode?
- From: Felipe Monteiro de Carvalho
- What kind of unicode?
- Prev by Date: Delphi and MacOS X ?
- Next by Date: Re: The alternative Delphi roadmap to success
- Previous by thread: Re: What kind of unicode?
- Next by thread: Re: What kind of unicode?
- Index(es):
Relevant Pages
|