Re: Defacto standard string library
- From: "J. J. Farrell" <jjf@xxxxxxxxxx>
- Date: Sun, 04 Jan 2009 01:57:55 +0000
Bartc wrote:
"Keith Thompson" <kst-u@xxxxxxx> wrote in message news:lnvdsv3ouk.fsf@xxxxxxxxxxxxxxxxxxPhil Carmody <thefatphil_demunged@xxxxxxxxxxx> writes:"Bartc" <bartc@xxxxxxxxxx> writes:[...]"Phil Carmody" <thefatphil_demunged@xxxxxxxxxxx> wrote in message
news:87eizk5fr0.fsf@xxxxxxxxxxxxxxxxxxxxxxx
[...]How does "\xEF\xBB\xBF\x40" compare against "\x41" using strcmp()?
Apparently the EF BB BF 40 sequence would be invalid UTF-8 (because
it's not the shortest way of encoding x40).
It's the first line I read from the UTF-8 encoded file that I just
fopen()ed.
Assuming Bartc is correct, if your file contains "\xEF\xBB\xBF\x40"
then it's not a UTF-8 encoded file. If the question is whether you
can compare UTF-8 vs. ASCII, presenting an example that's neither
UTF-8 nor ASCII is not particularly useful.
Probably not. I assumed EF BB BF 40 was another way of encoding x40. But EF BB BF is some sort of marker according to:
"Stephen Sprunk" <stephen@xxxxxxxxxx> wrote in message news:3pQ7l.516$%54.471@xxxxxxxxxxxxxxxxxxxxxxx(EF BB BF is UTF-8 for 0xFEFF)
It is actually a valid Unicode character, the "zero-width no-break space". It's used as a byte-order mark - embedded meta-data - since it's an "invisible" character which makes no difference to the appearance of the file when it's printed. UTF-8 doesn't need a BOM, but the same concept has been extended as a means of distinguishing a UTF-8 file from every other type of file (sort of). It can cause many problems.
The comparison would correctly report a difference since the first characters of the file are different when treated as UTF-8 encoded characters. If one string has junk on the front of it before the data you're interested in, you either need to strip that junk off or make sure the same junk is on the other string.
C's str*() functions can be effectively and correctly used to handle UTF-8 encoded strings, providing a large subset of the functionality they provide for ASCII strings. They certainly don't know about meta-data embedded in data strings, whether those strings contain single-byte characters or multi-byte characters.
.
- References:
- Re: Defacto standard string library
- From: user923005
- Re: Defacto standard string library
- From: Keith Thompson
- Re: Defacto standard string library
- From: Bartc
- Re: Defacto standard string library
- From: Richard Tobin
- Re: Defacto standard string library
- From: Phil Carmody
- Re: Defacto standard string library
- From: Bartc
- Re: Defacto standard string library
- From: Phil Carmody
- Re: Defacto standard string library
- From: Keith Thompson
- Re: Defacto standard string library
- From: Bartc
- Re: Defacto standard string library
- Prev by Date: Re: how to avoid mistaken integer comparisons
- Next by Date: Loading a variable with its maximum value
- Previous by thread: Re: Defacto standard string library
- Next by thread: Re: Defacto standard string library
- Index(es):
Relevant Pages
|