Re: Defacto standard string library



Bartc wrote:

"Keith Thompson" <kst-u@xxxxxxx> wrote in message news:lnvdsv3ouk.fsf@xxxxxxxxxxxxxxxxxx
Phil Carmody <thefatphil_demunged@xxxxxxxxxxx> writes:
"Bartc" <bartc@xxxxxxxxxx> writes:
"Phil Carmody" <thefatphil_demunged@xxxxxxxxxxx> wrote in message
news:87eizk5fr0.fsf@xxxxxxxxxxxxxxxxxxxxxxx
[...]
How does "\xEF\xBB\xBF\x40" compare against "\x41" using strcmp()?

Apparently the EF BB BF 40 sequence would be invalid UTF-8 (because
it's not the shortest way of encoding x40).

It's the first line I read from the UTF-8 encoded file that I just
fopen()ed.
[...]

Assuming Bartc is correct, if your file contains "\xEF\xBB\xBF\x40"
then it's not a UTF-8 encoded file. If the question is whether you
can compare UTF-8 vs. ASCII, presenting an example that's neither
UTF-8 nor ASCII is not particularly useful.

Probably not. I assumed EF BB BF 40 was another way of encoding x40. But EF BB BF is some sort of marker according to:

"Stephen Sprunk" <stephen@xxxxxxxxxx> wrote in message news:3pQ7l.516$%54.471@xxxxxxxxxxxxxxxxxxxxxxx
(EF BB BF is UTF-8 for 0xFEFF)

It is actually a valid Unicode character, the "zero-width no-break space". It's used as a byte-order mark - embedded meta-data - since it's an "invisible" character which makes no difference to the appearance of the file when it's printed. UTF-8 doesn't need a BOM, but the same concept has been extended as a means of distinguishing a UTF-8 file from every other type of file (sort of). It can cause many problems.

The comparison would correctly report a difference since the first characters of the file are different when treated as UTF-8 encoded characters. If one string has junk on the front of it before the data you're interested in, you either need to strip that junk off or make sure the same junk is on the other string.

C's str*() functions can be effectively and correctly used to handle UTF-8 encoded strings, providing a large subset of the functionality they provide for ASCII strings. They certainly don't know about meta-data embedded in data strings, whether those strings contain single-byte characters or multi-byte characters.
.



Relevant Pages

  • Re: not quite 1252
    ... The kill_gremlins function is intended to fix Unicode strings that have been obtained by decoding 8-bit strings using 'latin1' instead of 'cp1252'. ... In fact it wasn't, it was UTF-8 like Sergei wrote, but it was easy to convert it to cp1252, no problem. ... characters to documents marked up as ISO 8859-1 or other encodings. ...
    (comp.lang.python)
  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: Defacto standard string library
    ... And for 8-bit Characters this might be true. ... But when UTF-8 is being manipulated, this is going to cause problems. ... the same problem exists when you're comparing strings encoded with two different variants of ISO-8859; comparisons only work when the strings use the same encoding. ... If you're doing extensive manipulation of Unicode strings, converting to wide characters is almost always the correct solution. ...
    (comp.lang.c)
  • Re: Multilingual support in Len method
    ... This page explains details of decoding UTF-8. ... # UTF-8 encoded characters may theoretically be up to six bytes long, ... while storing a Unicode string internally. ... English strings return the exact number of characters. ...
    (microsoft.public.scripting.vbscript)
  • Re: RfD: XCHAR wordset
    ... It's somewhat worse, because Windows has "A" prototypes, which convert the ... current code page into UTF-16 on the fly. ... Actually, it might be possible to change the current code page to UTF-8, but ... Windows strings are usually not C strings, ...
    (comp.lang.forth)