Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- From: "Thorsten Engler [NexusDB]" <thorsten.engler@xxxxxxxxxxx>
- Date: 21 Jul 2008 14:55:43 -0700
John Herbster wrote:
Then for "surrogate pairs" which require two WideChars for their
representation, it seems to be that "exactly as before" character
indexing will require sometimes stepping over two WideChars instead
of one.
UTF16 has the huge advantage that the values for singeltons and leading
and trailing surrogate pairs do not overlap:
"In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead
units the range D800-DBFF and trail units the range DC00-DFFF".
As a result of this, for code like "split this string into individual
strings at each \" and a lot of other string processing that's
happening on a per character basis, you don't have to worry about the
surrogate pairs because the the trailing unit can never be mistaken for
some other valid character.
Are the individual WideChars stored big or little endian?In memory, usually whatever your current hardware platform perfers.
If little endian in Intel RAM, how are they stored in disk "text"
files and communicated over wires?
That's what a BOM is for: http://en.wikipedia.org/wiki/Byte_Order_Mark
All UTF16 strings that go "over the wire" or onto disk should be
prefixed by a BOM. Either U+FFFE or U+FEFF, depending on the byte order
of the following data.
What about the surrogate pairs? Is the low or high part of the pairThe order of the surrogate pairs always remains the same, the leading
at the lower address? And ditto for disk files and communications?
one comes before the trailing one.
Does that mean that UTF-16 characters are limited to 4-bytes?That's why they are called "surrogate pairs" and not "surrogate
sequences" or something like that. You either have a singelton or a
pair of a leading and trailing surrogate.
--
.
- Follow-Ups:
- Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- From: Ivan
- Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- From: John Herbster
- Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- From: John Herbster
- Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- References:
- Pre Delphi 2008-9 Unicode Do's and Dont's
- From: Lee Jenkins
- Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- From: Nick Hodges (Embarcadero)
- Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- From: John Herbster
- Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- From: Serge Dosyukov \(Dragon Soft\)
- Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- From: John Herbster
- Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- From: Nick Hodges (Embarcadero)
- Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- From: John Herbster
- Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- From: Remy Lebeau \(TeamB\)
- Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- From: John Herbster
- Pre Delphi 2008-9 Unicode Do's and Dont's
- Prev by Date: Re: Msg for Nick...
- Next by Date: Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- Previous by thread: Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- Next by thread: Re: Pre Delphi 2008-9 Unicode Do's and Dont's
- Index(es):
Relevant Pages
|
Loading