Re: Defacto standard string library




"Keith Thompson" <kst-u@xxxxxxx> wrote in message news:ln1vvl5k58.fsf@xxxxxxxxxxxxxxxxxx
user923005 <dcorbit@xxxxxxxxx> writes:

The str* functions don't work properly and the mem* functions do.

No, the str* functions work just fine.

Actually, I see no advantage in using mem* functions other than
possibly not trying something that I should not be trying.

I would never use standard C library functions to localize when ICU is
available. It's utter insanity.

Ok, ICU might be the greatest thing since sliced bread; perhaps I'll
take a look at it. That doesn't alter the fact that C's str*
functions *can* be used with UTF-8 strings. They don't, of course, do
everything, but what they do, they do correctly (assuming you use them
correctly).

Let's use 'char' for a typical 8-bit C character, and 'Character' for a proper wide character, ASCII or Unicode of 8,16, 32 or however many bits it needs to fully represent it. Then:

UTF-8 I thought was created for the transmission and storage of Character data, on systems adept only at 8-bit char data.

Trying to use C string functions on UTF-8 data seems highly dangerous. Perhaps at the very lowest level, where the C functions just do the equivalent of storing (char s[10]) or transmitting (strcpy()), you might get away with it.

But I'm guessing that a lot of string code in C assumes chars and Characters are interchangeable. And for 8-bit Characters this might be true. But when UTF-8 is being manipulated, this is going to cause problems.

Just think about what a simple reversestring() function might do to a UTF-8 sequence. Or a sort on the characters of a string (so that "bartc" becomes "abcrt"). Or using the values of a char to index into an array.

strcmp() will only work on UTF-8 if you make use of the result as either 0 or not 0. And if you use strcmp() on mixed UTF-8 and ordinary strings, then the result might be meaningless (a string containing a single encoded Unicode Character could match a string of several ordinary chars).

What I'm saying is that I think it's a bad idea to use C string functions on strings known to contain UTF-8. And that maybe you shouldn't be doing any processing on UTF-8 itself, but on proper wide char arrays (although I'm sure there are libraries containing some hairy code for working directly with UTF-8).

--
Bartc

.



Relevant Pages

  • Re: UTF-8 in char*
    ... >I am developing a vCard application which have to support UTF-8. ... >treat as NULL character in strlen? ... of type "unsigned char", as those will have at least 8 bits. ... strlensimply operates on an array ...
    (comp.lang.c)
  • Re: Unicode Emails vom Server als HTML files sichern oder so aehnlich..
    ... nicht UTF-8. ... ignoring text in character set `ISO-2022-JP' ... The returned string is in internal perl string representation and has ...
    (de.comp.lang.perl.misc)
  • Re: Defacto standard string library
    ... string manipulation code works as well and correctly with UTF-8 ... multibyte character strings as it does with ASCII strings. ... sequence is 0xC2 (when encoding character value 0x80). ...
    (comp.lang.c)
  • Re: can a character be negative?
    ... character means if I do something like.. ... no way we are doing to get a char representation as a positive value. ... signedness will generally not mix well with things like ... encoding/decoding UTF-8 chars, ... ...
    (comp.lang.c)
  • Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
    ... Wide character in print at -e line 1. ... The differences are in the encoding of the source file (UTF-8 vs. ... the string constant was converted to a character string: ...
    (comp.lang.perl.misc)