Re: Defacto standard string library
- From: "Bartc" <bartc@xxxxxxxxxx>
- Date: Sat, 03 Jan 2009 18:05:20 GMT
"Keith Thompson" <kst-u@xxxxxxx> wrote in message news:ln1vvl5k58.fsf@xxxxxxxxxxxxxxxxxx
user923005 <dcorbit@xxxxxxxxx> writes:
The str* functions don't work properly and the mem* functions do.
No, the str* functions work just fine.
Actually, I see no advantage in using mem* functions other than
possibly not trying something that I should not be trying.
I would never use standard C library functions to localize when ICU is
available. It's utter insanity.
Ok, ICU might be the greatest thing since sliced bread; perhaps I'll
take a look at it. That doesn't alter the fact that C's str*
functions *can* be used with UTF-8 strings. They don't, of course, do
everything, but what they do, they do correctly (assuming you use them
correctly).
Let's use 'char' for a typical 8-bit C character, and 'Character' for a proper wide character, ASCII or Unicode of 8,16, 32 or however many bits it needs to fully represent it. Then:
UTF-8 I thought was created for the transmission and storage of Character data, on systems adept only at 8-bit char data.
Trying to use C string functions on UTF-8 data seems highly dangerous. Perhaps at the very lowest level, where the C functions just do the equivalent of storing (char s[10]) or transmitting (strcpy()), you might get away with it.
But I'm guessing that a lot of string code in C assumes chars and Characters are interchangeable. And for 8-bit Characters this might be true. But when UTF-8 is being manipulated, this is going to cause problems.
Just think about what a simple reversestring() function might do to a UTF-8 sequence. Or a sort on the characters of a string (so that "bartc" becomes "abcrt"). Or using the values of a char to index into an array.
strcmp() will only work on UTF-8 if you make use of the result as either 0 or not 0. And if you use strcmp() on mixed UTF-8 and ordinary strings, then the result might be meaningless (a string containing a single encoded Unicode Character could match a string of several ordinary chars).
What I'm saying is that I think it's a bad idea to use C string functions on strings known to contain UTF-8. And that maybe you shouldn't be doing any processing on UTF-8 itself, but on proper wide char arrays (although I'm sure there are libraries containing some hairy code for working directly with UTF-8).
--
Bartc
.
- Follow-Ups:
- Re: Defacto standard string library
- From: Stephen Sprunk
- Re: Defacto standard string library
- From: Richard Tobin
- Re: Defacto standard string library
- References:
- Re: Defacto standard string library
- From: Phil Carmody
- Re: Defacto standard string library
- From: Keith Thompson
- Re: Defacto standard string library
- From: user923005
- Re: Defacto standard string library
- From: Stephen Sprunk
- Re: Defacto standard string library
- From: user923005
- Re: Defacto standard string library
- From: jameskuyper
- Re: Defacto standard string library
- From: user923005
- Re: Defacto standard string library
- From: Keith Thompson
- Re: Defacto standard string library
- Prev by Date: Re: A bit of fun. A programming puzzle to be done in C.
- Next by Date: Re: A bit of fun. A programming puzzle to be done in C.
- Previous by thread: Re: Defacto standard string library
- Next by thread: Re: Defacto standard string library
- Index(es):
Relevant Pages
|