Re: Defacto standard string library
- From: Phil Carmody <thefatphil_demunged@xxxxxxxxxxx>
- Date: Mon, 05 Jan 2009 12:25:00 +0200
Stephen Sprunk <stephen@xxxxxxxxxx> writes:
Phil Carmody wrote:
[SNIP - BOM gibbering, we're actually almost entirely on the
same paragraph of the same page, I'm just playing devils advocate
a bit, as I've seen too many programmers (or customers of their
software) bitten by such issues.]
I'm not trying to defend an absolute; I acknowledge that the algorithm
will theoretically be wrong for some files. However, the original
context was strings that were known to be UTF-8, so that problem does
not apply.
Ah. I interpreted the 'different' more liberally than you in the
"use strcmp() between strings in different encodings". I viewed
that to imply it might be any of strcmp(ascii, ascii),
strcmp(ascii, utf8), strcmp(utf8, ascii), or strcmp(utf8, utf8) -
i.e. the strings (both of them, independently of each other)
might be in different encodings at different times. (Such as if
you open two arbitrary files in order to cmp/diff them.)
A UTF-8 encoded BOM at the start of a file is almost certain to mean
the file is UTF-8, and many programs do insert it silently to ensure
that other programs can recognize the encoding.
Almost certain is not good enough to justify an absolute.
Of course, UTF-8 doesn't need a byte 'order' other than the
monotonic order of the bytes themselves.
Of course. However, since the BOM convention developed for UTF-16
files and was successful at marking the encoding used, it was a
logical extension to do the same with UTF-8.
As a purely historical aside - was it a logical extension (i.e.
something that was thought about and designed in advanced), or
did people simply realise that when they ran BOM'ed UTF-16
files through their naive UTF-16..UTF-8 converter, those three
bytes were consistently squirted out first and that could be
post-facto used as a file type identifier? If the latter, then
had they been using a less naive UTF-16..UTF-8 converter which
recognised the /semantics/ of the BOM and justifiably dropped it
upon conversion, the UTF-8 BOM may never have been invented?
Note the same problem exists with UTF-16 BOMs: a file in some other
encoding could potentially start with FE FF or FF FE. And how do you
distinguish a BOM from a file without a BOM that starts with a
Zero-Width No-Break Space? It's unlikely but theoretically possible.
Absolutely.
Without the BOM, a program has to guess what encoding is used for
the file based on heuristics or format-specific information in the
file -- and they're not particularly good at it, in my experience.
Ditto. In some ways incompatibility would have been better, as
then there'd be no exceptional corner cases.
While I can appreciate the sentiment, I also understand that "perfect"
is the enemy of "good enough". If it's absolutely critical that you
not misinterpret a file as UTF-8 when it's not, provide the user (or
file format) with a mechanism to explicitly specify the encoding --
but users will appreciate having the "automagic" encoding detection
when it works.
This is one reason why I much prefer my code talking to
hardware rather than to users! :-D
Phil
--
I tried the Vista speech recognition by running the tutorial. I was
amazed, it was awesome, recognised every word I said. Then I said the
wrong word ... and it typed the right one. It was actually just
detecting a sound and printing the expected word! -- pbhj on /.
.
- References:
- Re: Defacto standard string library
- From: user923005
- Re: Defacto standard string library
- From: Keith Thompson
- Re: Defacto standard string library
- From: Bartc
- Re: Defacto standard string library
- From: Richard Tobin
- Re: Defacto standard string library
- From: Phil Carmody
- Re: Defacto standard string library
- From: Bartc
- Re: Defacto standard string library
- From: Phil Carmody
- Re: Defacto standard string library
- From: Stephen Sprunk
- Re: Defacto standard string library
- From: Phil Carmody
- Re: Defacto standard string library
- From: Stephen Sprunk
- Re: Defacto standard string library
- Prev by Date: Re: OT: My killfile
- Next by Date: Re: Defacto standard string library
- Previous by thread: Re: Defacto standard string library
- Next by thread: Re: Defacto standard string library
- Index(es):
Relevant Pages
|