Re: Defacto standard string library



"J. J. Farrell" <jjf@xxxxxxxxxx> writes:

Ben Bacarisse wrote:
"J. J. Farrell" <jjf@xxxxxxxxxx> writes:

Ben Bacarisse wrote:
"J. J. Farrell" <jjf@xxxxxxxxxx> writes:

user923005 wrote:
<snip>
I guess that the general philosophy of C is that "I know what I am
doing so let me do it" does come into play here. And if you are very
careful you could use C's str*() functions to fiddle with 8 bit
Unicode.
<snip>
It depends entirely on what you need to do. Lots of str*()-based
string manipulation code works as well and correctly with UTF-8
multibyte character strings as it does with ASCII strings. Some things
need some minor coding (for example, strlen() tells you the number of
data bytes in the string, not the number of characters; to get the
number of characters just walk the string as an unsigned char array
and count the number of bytes with values less that 128).
I don't think so.
Good grief; thanks Ben, my memory of UTF-8 encoding is worse than I
thought. That algorithm is nonsense.

Did you mean add "and >= 192"?
No; I meant what I said, but I was wrong. For some reason I was
thinking that each UTF-8 character encoding ended with a byte in the
ASCII range. As you imply, in UTF-8 each character includes either one
byte in the ASCII range or one byte >= 0xC2. "x < 0x80 || x > 0xC0"
would work correctly.

Sorry be picky but your text is wrong because there is typo (0xC2 when
you mean 0xC0) but the place where you correct that (in the C
expression) you replace the correct >= with an incorrect >.

In summary (and here is where I will make some mistake if Usenet
protocols are to be observed) to count UTF-8 characters:

size_t utf_8_len(const unsigned char *cp)
{
size_t len = 0;
while (*cp) {
len += *cp < 0x80 || *cp >= 0xC0;
cp += 1;
}
return len;
}

I believe I'm right this time (though the subtle difference doesn't
matter - your code would be just as correct). The lowest possible
value which can appear in the first byte of a multi-byte UTF-8
sequence is 0xC2 (when encoding character value 0x80). The byte values
0xC0 and 0xC1 can't appear in a UTF-8 string.

Yes, my turn to stop thinking. 0xC1 and 0xC2 always indicate an
overlong encoding. There is no harm in a looser test (just I don't
exclude 0xFE and 0xFF either) since all bets are off for this sort of
code if the sequence is not valid UTF-8 to start with.

--
Ben.
.



Relevant Pages

  • Re: UTF-8 encoding
    ... I need to pass a UTF-8 encoded writer ... reading that file with the system's default encoding. ... String), but used elsewhere as if it were a StringBuffer. ... There's a very good reason that ...
    (comp.lang.java.programmer)
  • Re: Understanding simplest HTML page
    ... Even the BBC managed to put invalid ... > technical details of using a particular encoding, ... Bengali and so on using utf-8 ... Mozilla has routines for automatically guessing at character ...
    (comp.infosystems.www.authoring.html)
  • Re: DBD::ODBC and character sets
    ... you have and accept UTF-8 encoded data does mean you need to "use ... encoding" but if your script is encoded in xxx you need "use encoding ... Perl sees the left-hand side of eq as a string literal containg sixcharacters encoded as ISO-8859-1 ...
    (perl.dbi.users)
  • Re: Writing to the newsgroup?
    ... you should be able to set the encoding and use the encoding you ... I'm not familiear with Unitype Global writer, ... However, if you use its help feature to inquire about 'character encoding', ... Here's the UTF-8 test. ...
    (sci.lang.japan)
  • Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
    ... SPACE in some other encoding. ... headers that define the character set. ... define the character set as UTF-8, ... encoded in Mac-Roman. ...
    (alt.html)