Re: Defacto standard string library



Bartc wrote:
But I'm guessing that a lot of string code in C assumes chars and Characters are interchangeable. And for 8-bit Characters this might be true. But when UTF-8 is being manipulated, this is going to cause problems.

That depends what you do to the strings; a heck of a lot of code can handle UTF-8 strings without modification because it doesn't break up multibyte characters.

Just think about what a simple reversestring() function might do to a UTF-8 sequence. Or a sort on the characters of a string (so that "bartc" becomes "abcrt"). Or using the values of a char to index into an array.

Indeed, those scenarios are problematic for _any_ multibyte encoding.

strcmp() will only work on UTF-8 if you make use of the result as either 0 or not 0.

The nonzero results also work fine for sorting UTF-8 strings; the magnitude may vary from what wcscmp() would return, but the sign will be the same from both functions, and that's often enough.

And if you use strcmp() on mixed UTF-8 and ordinary strings, then the result might be meaningless (a string containing a single encoded Unicode Character could match a string of several ordinary chars).

Only if the "ordinary" strings contain characters with the high bit set, in which case they're not so "ordinary". However, the same problem exists when you're comparing strings encoded with two different variants of ISO-8859; comparisons only work when the strings use the same encoding. And, if you're dealing with multiple encodings, the only effective solution is to convert all your strings to a common universal encoding, e.g. UTF-8 or UTF-32.

What I'm saying is that I think it's a bad idea to use C string functions on strings known to contain UTF-8.

The primary purpose of UTF-8 was to get preexisting code to accept Unicode characters, though it's become a pretty common in file/communications formats that require a dense, universal encoding.

And that maybe you shouldn't be doing any processing on UTF-8 itself,
but on proper wide char arrays (although I'm sure there are libraries
containing some hairy code for working directly with UTF-8).

If you're doing extensive manipulation of Unicode strings, converting to wide characters is almost always the correct solution. For relatively minor manipulation, though, UTF-8 is often sufficient.

S
.



Relevant Pages

  • Re: not quite 1252
    ... The kill_gremlins function is intended to fix Unicode strings that have been obtained by decoding 8-bit strings using 'latin1' instead of 'cp1252'. ... In fact it wasn't, it was UTF-8 like Sergei wrote, but it was easy to convert it to cp1252, no problem. ... characters to documents marked up as ISO 8859-1 or other encodings. ...
    (comp.lang.python)
  • Re: Multilingual support in Len method
    ... This page explains details of decoding UTF-8. ... # UTF-8 encoded characters may theoretically be up to six bytes long, ... while storing a Unicode string internally. ... English strings return the exact number of characters. ...
    (microsoft.public.scripting.vbscript)
  • Re: Defacto standard string library
    ... then it's not a UTF-8 encoded file. ... I assumed EF BB BF 40 was another way of encoding x40. ... The comparison would correctly report a difference since the first characters of the file are different when treated as UTF-8 encoded characters. ... C's str*functions can be effectively and correctly used to handle UTF-8 encoded strings, providing a large subset of the functionality they provide for ASCII strings. ...
    (comp.lang.c)
  • special characters screwing up string operations
    ... Im doing some manipulation of strings, and there are some characters ... If you try to paste these into a Ruby console the cursor will jump ...
    (comp.lang.ruby)
  • UTF-8 in strings - a bug?
    ... The predefined type Character is a character type whose values ... Strings are Latin-1 (except for programs compiled ... strings I get from Ada.Command_Line.Argument contain UTF-8. ... values that are parts of multi-byte characters. ...
    (comp.lang.ada)