Re: Defacto standard string library
- From: Stephen Sprunk <stephen@xxxxxxxxxx>
- Date: Mon, 05 Jan 2009 01:09:58 -0600
Phil Carmody wrote:
Stephen Sprunk <stephen@xxxxxxxxxx> writes:Phil Carmody wrote:"Bartc" <bartc@xxxxxxxxxx> writes:"Phil Carmody" <thefatphil_demunged@xxxxxxxxxxx> wrote in messageHow does "\xEF\xBB\xBF\x40" compare against "\x41" using strcmp()?
Apparently the EF BB BF 40 sequence would be invalid UTF-8 (because
it's not the shortest way of encoding x40).
It's the first line I read from the UTF-8 encoded file that I just
fopen()ed. "\x41" was the first line I read from the ASCII encoded
file that I also just fopen()ed. How do these two lines compare?
You cannot demand that I unconditionally drop any "\xEF\xBB\xBF"
from the first line of a file before performing the comparison. Were
you to do so, you'd bugger any ISO 8859-15 file beginning "".
How many valid 8859-15 text files actually begin with that sequence?
How hard do you think it would be for me to create one?
Absolutely trivial. However, I challenge you to find a file in the wild (i.e. not created for the purpose of making your point) that starts with that sequence where it is _not_ a BOM encoded in UTF-8.
What percentage of files that begin with that sequence use some
encoding _other than_ UTF-8? Hint: virtually none.
How many primes are even? It is inappropriate to use an absolute (which was the context I was responding to) when
it clearly is not absolute.
I'm not trying to defend an absolute; I acknowledge that the algorithm will theoretically be wrong for some files. However, the original context was strings that were known to be UTF-8, so that problem does not apply.
A UTF-8 encoded BOM at the start of a file is almost certain to mean
the file is UTF-8, and many programs do insert it silently to ensure
that other programs can recognize the encoding.
Almost certain is not good enough to justify an absolute.
Of course, UTF-8 doesn't need a byte 'order' other than the
monotonic order of the bytes themselves.
Of course. However, since the BOM convention developed for UTF-16 files and was successful at marking the encoding used, it was a logical extension to do the same with UTF-8.
Note the same problem exists with UTF-16 BOMs: a file in some other encoding could potentially start with FE FF or FF FE. And how do you distinguish a BOM from a file without a BOM that starts with a Zero-Width No-Break Space? It's unlikely but theoretically possible.
Without the BOM, a program has to guess what encoding is used for
the file based on heuristics or format-specific information in the
file -- and they're not particularly good at it, in my experience.
Ditto. In some ways incompatibility would have been better, as
then there'd be no exceptional corner cases.
While I can appreciate the sentiment, I also understand that "perfect" is the enemy of "good enough". If it's absolutely critical that you not misinterpret a file as UTF-8 when it's not, provide the user (or file format) with a mechanism to explicitly specify the encoding -- but users will appreciate having the "automagic" encoding detection when it works.
S
.
- Follow-Ups:
- Re: Defacto standard string library
- From: Phil Carmody
- Re: Defacto standard string library
- References:
- Re: Defacto standard string library
- From: user923005
- Re: Defacto standard string library
- From: Keith Thompson
- Re: Defacto standard string library
- From: Bartc
- Re: Defacto standard string library
- From: Richard Tobin
- Re: Defacto standard string library
- From: Phil Carmody
- Re: Defacto standard string library
- From: Bartc
- Re: Defacto standard string library
- From: Phil Carmody
- Re: Defacto standard string library
- From: Stephen Sprunk
- Re: Defacto standard string library
- From: Phil Carmody
- Re: Defacto standard string library
- Prev by Date: assert (unsigned > 0)
- Next by Date: Re: Type-checking casts for GNU C
- Previous by thread: Re: Defacto standard string library
- Next by thread: Re: Defacto standard string library
- Index(es):
Relevant Pages
|