Re: Defacto standard string library
- From: Stephen Sprunk <stephen@xxxxxxxxxx>
- Date: Sat, 03 Jan 2009 14:56:30 -0600
Bartc wrote:
But I'm guessing that a lot of string code in C assumes chars and Characters are interchangeable. And for 8-bit Characters this might be true. But when UTF-8 is being manipulated, this is going to cause problems.
That depends what you do to the strings; a heck of a lot of code can handle UTF-8 strings without modification because it doesn't break up multibyte characters.
Just think about what a simple reversestring() function might do to a UTF-8 sequence. Or a sort on the characters of a string (so that "bartc" becomes "abcrt"). Or using the values of a char to index into an array.
Indeed, those scenarios are problematic for _any_ multibyte encoding.
strcmp() will only work on UTF-8 if you make use of the result as either 0 or not 0.
The nonzero results also work fine for sorting UTF-8 strings; the magnitude may vary from what wcscmp() would return, but the sign will be the same from both functions, and that's often enough.
And if you use strcmp() on mixed UTF-8 and ordinary strings, then the result might be meaningless (a string containing a single encoded Unicode Character could match a string of several ordinary chars).
Only if the "ordinary" strings contain characters with the high bit set, in which case they're not so "ordinary". However, the same problem exists when you're comparing strings encoded with two different variants of ISO-8859; comparisons only work when the strings use the same encoding. And, if you're dealing with multiple encodings, the only effective solution is to convert all your strings to a common universal encoding, e.g. UTF-8 or UTF-32.
What I'm saying is that I think it's a bad idea to use C string functions on strings known to contain UTF-8.
The primary purpose of UTF-8 was to get preexisting code to accept Unicode characters, though it's become a pretty common in file/communications formats that require a dense, universal encoding.
And that maybe you shouldn't be doing any processing on UTF-8 itself,
but on proper wide char arrays (although I'm sure there are libraries
containing some hairy code for working directly with UTF-8).
If you're doing extensive manipulation of Unicode strings, converting to wide characters is almost always the correct solution. For relatively minor manipulation, though, UTF-8 is often sufficient.
S
.
- References:
- Re: Defacto standard string library
- From: Phil Carmody
- Re: Defacto standard string library
- From: Keith Thompson
- Re: Defacto standard string library
- From: user923005
- Re: Defacto standard string library
- From: Stephen Sprunk
- Re: Defacto standard string library
- From: user923005
- Re: Defacto standard string library
- From: jameskuyper
- Re: Defacto standard string library
- From: user923005
- Re: Defacto standard string library
- From: Keith Thompson
- Re: Defacto standard string library
- From: Bartc
- Re: Defacto standard string library
- Prev by Date: Re: Null terminated strings: bad or good?
- Next by Date: Re: Null terminated strings: bad or good?
- Previous by thread: Re: Defacto standard string library
- Next by thread: Re: Defacto standard string library
- Index(es):
Relevant Pages
|