Re: Getting prepared for Unicode



Hi Allen,

SizeOf(Char) = 2
string = UnicodeString

If I may throw in a request here: Please don't neglect the UTF8String type.

UTF-8 uses less storage than UTF-16 when working with mostly European languages. Also, neither encoding uses a fixed number of bytes for every Unicode code point, so there is not much difference in the degree of code complexity required to manipulate strings using either encoding.

Since conversion between UTF-8 and UTF-16 is quite fast, I generally prefer storing and manpulating all text in UTF-8 format, converting to UTF-16 only for displaying the text through the Windows API. (The CPU time used for conversion from UTF-8 to UTF-16 is insignificant compared to the time taken to render the display output.)

It would be useful if the language supported implicit conversions between UnicodeString and UTF8String, similar to the way it currently does with WideString and AnsiString.

Thanks,
Pierre
.



Relevant Pages

  • Re: unicode file
    ... If there is a BOM, the file is treated as UTF-8 or UTF-16LE ... When a file is opened for writing using _O_WTEXT, UTF-16 ... My small library does the UTF-16 to UTF-8 conversion behind the scene. ...
    (microsoft.public.vc.mfc)
  • Re: GAS-style syntax issue...
    ... but, alas, the issue becomes a little more hairy than a few simple parser ... I guess it is an issue right up there with making the assembler UTF-8 ... (UTF-16 just wastes too much memory IMO, ... majority of text is ASCII... ...
    (alt.lang.asm)
  • Re: UTF-16 file input, C programming.
    ... However, you are only partly correct, from the fact that all standard ASCII chars, are mapped on a single byte as you mention. ... UTF-8 only maps the standard ASCII chars in one byte and anything above is represented in two or more bytes. ... I believe unicode.org has some source, providing functions, that can convert UTF-16 surrogate pairs, into UTF-8 multibyte characters, but I will have to look into that. ...
    (comp.unix.programmer)
  • Re: unicode in ruby
    ... UNIX program: UTF-16 allows the octect 0x00, ... Hence the existence of UTF-8. ... exception is the single octet 0x00. ... UTF-16, UTF-32, and every other variation of Unicode. ...
    (comp.lang.ruby)
  • Re: MBCS oder Unicode
    ... 90% Texte habe, der aus lateinischen Buchstaben bestehen, dann ist die Frage nach UTF-8 oder UTF-16 IMHO wohl berechtigt. ... UTF-8 und UTF-16 bei ..NET wandeln. ... jedes Diakrit mit jedem Zeichen kombinieren zu wollen. ...
    (microsoft.public.de.vc)