Re: Defacto standard string library



Phil Carmody wrote:
Stephen Sprunk <stephen@xxxxxxxxxx> writes:
Phil Carmody wrote:
"Bartc" <bartc@xxxxxxxxxx> writes:
"Phil Carmody" <thefatphil_demunged@xxxxxxxxxxx> wrote in message
How does "\xEF\xBB\xBF\x40" compare against "\x41" using strcmp()?

Apparently the EF BB BF 40 sequence would be invalid UTF-8 (because
it's not the shortest way of encoding x40).

It's the first line I read from the UTF-8 encoded file that I just
fopen()ed. "\x41" was the first line I read from the ASCII encoded
file that I also just fopen()ed. How do these two lines compare?
You cannot demand that I unconditionally drop any "\xEF\xBB\xBF"
from the first line of a file before performing the comparison. Were
you to do so, you'd bugger any ISO 8859-15 file beginning "".

How many valid 8859-15 text files actually begin with that sequence?

How hard do you think it would be for me to create one?

Absolutely trivial. However, I challenge you to find a file in the wild (i.e. not created for the purpose of making your point) that starts with that sequence where it is _not_ a BOM encoded in UTF-8.

What percentage of files that begin with that sequence use some
encoding _other than_ UTF-8? Hint: virtually none.

How many primes are even? It is inappropriate to use an absolute (which was the context I was responding to) when
it clearly is not absolute.

I'm not trying to defend an absolute; I acknowledge that the algorithm will theoretically be wrong for some files. However, the original context was strings that were known to be UTF-8, so that problem does not apply.

A UTF-8 encoded BOM at the start of a file is almost certain to mean
the file is UTF-8, and many programs do insert it silently to ensure
that other programs can recognize the encoding.

Almost certain is not good enough to justify an absolute.

Of course, UTF-8 doesn't need a byte 'order' other than the
monotonic order of the bytes themselves.

Of course. However, since the BOM convention developed for UTF-16 files and was successful at marking the encoding used, it was a logical extension to do the same with UTF-8.

Note the same problem exists with UTF-16 BOMs: a file in some other encoding could potentially start with FE FF or FF FE. And how do you distinguish a BOM from a file without a BOM that starts with a Zero-Width No-Break Space? It's unlikely but theoretically possible.

Without the BOM, a program has to guess what encoding is used for
the file based on heuristics or format-specific information in the
file -- and they're not particularly good at it, in my experience.

Ditto. In some ways incompatibility would have been better, as
then there'd be no exceptional corner cases.

While I can appreciate the sentiment, I also understand that "perfect" is the enemy of "good enough". If it's absolutely critical that you not misinterpret a file as UTF-8 when it's not, provide the user (or file format) with a mechanism to explicitly specify the encoding -- but users will appreciate having the "automagic" encoding detection when it works.

S
.



Relevant Pages

  • Re: Defacto standard string library
    ... context was strings that were known to be UTF-8, ... that other programs can recognize the encoding. ... since the BOM convention developed for UTF-16 ... I tried the Vista speech recognition by running the tutorial. ...
    (comp.lang.c)
  • Re: automating the SQL warning and the choice of text format
    ... automatically select 'yes' and 'utf-8' rather than changing the registry, ... In order to get the correct encoding, I believe that you have to do the ... You need one of those for each data source. ... For a comma-delimited file using UTF-8 encoding, ...
    (microsoft.public.word.mailmerge.fields)
  • Re: DBD::ODBC and character sets
    ... you have and accept UTF-8 encoded data does mean you need to "use ... encoding" but if your script is encoded in xxx you need "use encoding ... Perl sees the left-hand side of eq as a string literal containg sixcharacters encoded as ISO-8859-1 ...
    (perl.dbi.users)
  • Re: PEP 263 status check
    ... > chosing windows-1252 as the source encoding. ... in the string module, the string methods and all through ... encoded data (including utf-8 encodings) ... character that is outside of the 7-bit ascii subset. ...
    (comp.lang.python)
  • Re: convert from utf-8 to unicode(excel)
    ... Is there a possibility to properly convert under Windows from utf-8 ... encoding to unicode ... There is no problem in conversion when I do it in Notepad. ... a file marking encoding as UTF-8 and then save it marking encoding as ...
    (comp.editors)