Re: Defacto standard string library
- From: Ben Bacarisse <ben.usenet@xxxxxxxxx>
- Date: Sun, 04 Jan 2009 02:56:00 +0000
"J. J. Farrell" <jjf@xxxxxxxxxx> writes:
Ben Bacarisse wrote:
"J. J. Farrell" <jjf@xxxxxxxxxx> writes:
Ben Bacarisse wrote:
"J. J. Farrell" <jjf@xxxxxxxxxx> writes:Good grief; thanks Ben, my memory of UTF-8 encoding is worse than I
user923005 wrote:<snip>
<snip>I guess that the general philosophy of C is that "I know what I am
doing so let me do it" does come into play here. And if you are very
careful you could use C's str*() functions to fiddle with 8 bit
Unicode.
It depends entirely on what you need to do. Lots of str*()-basedI don't think so.
string manipulation code works as well and correctly with UTF-8
multibyte character strings as it does with ASCII strings. Some things
need some minor coding (for example, strlen() tells you the number of
data bytes in the string, not the number of characters; to get the
number of characters just walk the string as an unsigned char array
and count the number of bytes with values less that 128).
thought. That algorithm is nonsense.
Did you mean add "and >= 192"?No; I meant what I said, but I was wrong. For some reason I was
thinking that each UTF-8 character encoding ended with a byte in the
ASCII range. As you imply, in UTF-8 each character includes either one
byte in the ASCII range or one byte >= 0xC2. "x < 0x80 || x > 0xC0"
would work correctly.
Sorry be picky but your text is wrong because there is typo (0xC2 when
you mean 0xC0) but the place where you correct that (in the C
expression) you replace the correct >= with an incorrect >.
In summary (and here is where I will make some mistake if Usenet
protocols are to be observed) to count UTF-8 characters:
size_t utf_8_len(const unsigned char *cp)
{
size_t len = 0;
while (*cp) {
len += *cp < 0x80 || *cp >= 0xC0;
cp += 1;
}
return len;
}
I believe I'm right this time (though the subtle difference doesn't
matter - your code would be just as correct). The lowest possible
value which can appear in the first byte of a multi-byte UTF-8
sequence is 0xC2 (when encoding character value 0x80). The byte values
0xC0 and 0xC1 can't appear in a UTF-8 string.
Yes, my turn to stop thinking. 0xC1 and 0xC2 always indicate an
overlong encoding. There is no harm in a looser test (just I don't
exclude 0xFE and 0xFF either) since all bets are off for this sort of
code if the sequence is not valid UTF-8 to start with.
--
Ben.
.
- References:
- Re: Defacto standard string library
- From: Phil Carmody
- Re: Defacto standard string library
- From: Keith Thompson
- Re: Defacto standard string library
- From: user923005
- Re: Defacto standard string library
- From: Stephen Sprunk
- Re: Defacto standard string library
- From: user923005
- Re: Defacto standard string library
- From: jameskuyper
- Re: Defacto standard string library
- From: user923005
- Re: Defacto standard string library
- From: Keith Thompson
- Re: Defacto standard string library
- From: user923005
- Re: Defacto standard string library
- From: J. J. Farrell
- Re: Defacto standard string library
- From: Ben Bacarisse
- Re: Defacto standard string library
- From: J. J. Farrell
- Re: Defacto standard string library
- From: Ben Bacarisse
- Re: Defacto standard string library
- From: J. J. Farrell
- Re: Defacto standard string library
- Prev by Date: Loading a variable with its maximum value
- Next by Date: Re: how to avoid mistaken integer comparisons
- Previous by thread: Re: Defacto standard string library
- Next by thread: Re: Defacto standard string library
- Index(es):
Relevant Pages
|