Re: Defacto standard string library




"Phil Carmody" <thefatphil_demunged@xxxxxxxxxxx> wrote in message
news:87eizk5fr0.fsf@xxxxxxxxxxxxxxxxxxxxxxx
richard@xxxxxxxxxxxxxxx (Richard Tobin) writes:
In article <ANN7l.14634$Sp5.13522@xxxxxxxxxxxxxxxxxxxxxxxxx>,
Bartc <bartc@xxxxxxxxxx> wrote:

strcmp() will only work on UTF-8 if you make use of the result as either
0
or not 0.

No, it will give Unicode ordering.

And if you use strcmp() on mixed UTF-8 and ordinary strings, then
the result might be meaningless (a string containing a single encoded
Unicode Character could match a string of several ordinary chars).

If you use strcmp() between strings in different encodings of course
the result is likely to be meaningless. However UTF-8 has the advantage
that it can be compared against ascii, since ascii is a subset of UTF-8.

How does "\xEF\xBB\xBF\x40" compare against "\x41" using strcmp()?

Apparently the EF BB BF 40 sequence would be invalid UTF-8 (because it's not
the shortest way of encoding x40).

People keep saying UTF-8 is compatible with all these string functions but
I'm not too happy about it myself. The functions aren't used in isolation
and a lot of user code(existing and future) needs to be aware of pitfalls.

--
Bartc

.



Relevant Pages

  • Re: Defacto standard string library
    ... Unicode Character could match a string of several ordinary chars). ... the result is likely to be meaningless. ... that it can be compared against ascii, since ascii is a subset of UTF-8. ...
    (comp.lang.c)
  • Re: Defacto standard string library
    ... it will give Unicode ordering. ... Unicode Character could match a string of several ordinary chars). ... the result is likely to be meaningless. ... that it can be compared against ascii, since ascii is a subset of UTF-8. ...
    (comp.lang.c)
  • Re: Interpretation of extensions different from Unix/Linux?
    ... the use of UTF-8 in this way is the recommendation of the ARG. ... (UTF-8 is a problem of its own in Ada. ... a UTF-8 encoded string is a String. ... You can't enumerate roots in Windows, ...
    (comp.lang.ada)
  • Re: Unicode Delphi Win32 - which approach
    ... I like the backwards compatibility aspects of UTF-8 vs UTF-16. ... The first 256 Unicode characters map to the ANSI character set. ... entire stream> but calling an API 100 times in a loop I can imagine. ... and explicitly contextualise every string. ...
    (borland.public.delphi.non-technical)
  • Re: UTF-8 encoding
    ... I need to pass a UTF-8 encoded writer ... reading that file with the system's default encoding. ... String), but used elsewhere as if it were a StringBuffer. ... There's a very good reason that ...
    (comp.lang.java.programmer)