Re: Defacto standard string library



"Bartc" <bartc@xxxxxxxxxx> writes:
"Phil Carmody" <thefatphil_demunged@xxxxxxxxxxx> wrote in message
news:87eizk5fr0.fsf@xxxxxxxxxxxxxxxxxxxxxxx
richard@xxxxxxxxxxxxxxx (Richard Tobin) writes:
In article <ANN7l.14634$Sp5.13522@xxxxxxxxxxxxxxxxxxxxxxxxx>,
Bartc <bartc@xxxxxxxxxx> wrote:

strcmp() will only work on UTF-8 if you make use of the result as either
0
or not 0.

No, it will give Unicode ordering.

And if you use strcmp() on mixed UTF-8 and ordinary strings, then
the result might be meaningless (a string containing a single encoded
Unicode Character could match a string of several ordinary chars).

If you use strcmp() between strings in different encodings of course
the result is likely to be meaningless. However UTF-8 has the advantage
that it can be compared against ascii, since ascii is a subset of UTF-8.

How does "\xEF\xBB\xBF\x40" compare against "\x41" using strcmp()?

Apparently the EF BB BF 40 sequence would be invalid UTF-8 (because it's not
the shortest way of encoding x40).

It's the first line I read from the UTF-8 encoded file that I just
fopen()ed. "\x41" was the first line I read from the ASCII encoded
file that I also just fopen()ed. How do these two lines compare?
You cannot demand that I unconditionally drop any "\xEF\xBB\xBF"
from the first line of a file before performing the comparison. Were
you to do so, you'd bugger any ISO 8859-15 file beginning "".

People keep saying UTF-8 is compatible with all these string functions but
I'm not too happy about it myself. The functions aren't used in isolation
and a lot of user code(existing and future) needs to be aware of pitfalls.

UTF-8 strings, as sequences of Unicode characters, aren't arrays.
Anything which treats them as arrays can potentially have pitfalls.
So it's not just the str*()s that are the problem.

Phil
--
I tried the Vista speech recognition by running the tutorial. I was
amazed, it was awesome, recognised every word I said. Then I said the
wrong word ... and it typed the right one. It was actually just
detecting a sound and printing the expected word! -- pbhj on /.
.



Relevant Pages

  • Re: Defacto standard string library
    ... Unicode Character could match a string of several ordinary chars). ... the result is likely to be meaningless. ... that it can be compared against ascii, since ascii is a subset of UTF-8. ... People keep saying UTF-8 is compatible with all these string functions but ...
    (comp.lang.c)
  • Re: Defacto standard string library
    ... it will give Unicode ordering. ... Unicode Character could match a string of several ordinary chars). ... the result is likely to be meaningless. ... that it can be compared against ascii, since ascii is a subset of UTF-8. ...
    (comp.lang.c)
  • Re: Interpretation of extensions different from Unix/Linux?
    ... the use of UTF-8 in this way is the recommendation of the ARG. ... (UTF-8 is a problem of its own in Ada. ... a UTF-8 encoded string is a String. ... You can't enumerate roots in Windows, ...
    (comp.lang.ada)
  • Re: Unicode Delphi Win32 - which approach
    ... I like the backwards compatibility aspects of UTF-8 vs UTF-16. ... The first 256 Unicode characters map to the ANSI character set. ... entire stream> but calling an API 100 times in a loop I can imagine. ... and explicitly contextualise every string. ...
    (borland.public.delphi.non-technical)
  • Re: UTF-8 encoding
    ... I need to pass a UTF-8 encoded writer ... reading that file with the system's default encoding. ... String), but used elsewhere as if it were a StringBuffer. ... There's a very good reason that ...
    (comp.lang.java.programmer)