Re: Pre Delphi 2008-9 Unicode Do's and Dont's



John Herbster wrote:

Then for "surrogate pairs" which require two WideChars for their
representation, it seems to be that "exactly as before" character
indexing will require sometimes stepping over two WideChars instead
of one.

UTF16 has the huge advantage that the values for singeltons and leading
and trailing surrogate pairs do not overlap:

"In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead
units the range D800-DBFF and trail units the range DC00-DFFF".

As a result of this, for code like "split this string into individual
strings at each \" and a lot of other string processing that's
happening on a per character basis, you don't have to worry about the
surrogate pairs because the the trailing unit can never be mistaken for
some other valid character.

Are the individual WideChars stored big or little endian?
In memory, usually whatever your current hardware platform perfers.

If little endian in Intel RAM, how are they stored in disk "text"
files and communicated over wires?

That's what a BOM is for: http://en.wikipedia.org/wiki/Byte_Order_Mark

All UTF16 strings that go "over the wire" or onto disk should be
prefixed by a BOM. Either U+FFFE or U+FEFF, depending on the byte order
of the following data.

What about the surrogate pairs? Is the low or high part of the pair
at the lower address? And ditto for disk files and communications?
The order of the surrogate pairs always remains the same, the leading
one comes before the trailing one.

Does that mean that UTF-16 characters are limited to 4-bytes?
That's why they are called "surrogate pairs" and not "surrogate
sequences" or something like that. You either have a singelton or a
pair of a leading and trailing surrogate.


--

.



Relevant Pages

  • Re: Interating over the characters in a string
    ... double-quote character can be embedded within a string. ... Also, commas can occur in the quoted string too, so Splitmay not work ... >> The problem is I have a CSV parser that will successfully parse out ... > Although Unicode allows surrogate pairs, ...
    (microsoft.public.dotnet.framework)
  • [TOMOYO #15 3/8] Common functions for TOMOYO Linux.
    ... This file contains common functions (e.g. policy I/O, pattern matching). ... Since TOMOYO Linux is a name based access control, ... TOMOYO Linux's string manipulation functions make reviewers feel crazy, ... the Linux kernel accepts all characters but NUL character ...
    (Linux-Kernel)
  • RfD: Escaped Strings version 4
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... as an escape character for the entry of characters that cannot be ... \b BS (backspace, ASCII 8) ...
    (comp.lang.forth)
  • RfD: Escaped Strings version 4
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... as an escape character for the entry of characters that cannot be ... \b BS (backspace, ASCII 8) ...
    (comp.lang.forth)
  • Re: RfD: Escaped Strings
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... \b BS (backspace, ASCII 8) ... \ ** escapes to characters much as C does. ...
    (comp.lang.forth)

Loading