Re: Defacto standard string library



Stephen Sprunk <stephen@xxxxxxxxxx> writes:
Phil Carmody wrote:

[SNIP - BOM gibbering, we're actually almost entirely on the
same paragraph of the same page, I'm just playing devils advocate
a bit, as I've seen too many programmers (or customers of their
software) bitten by such issues.]

I'm not trying to defend an absolute; I acknowledge that the algorithm
will theoretically be wrong for some files. However, the original
context was strings that were known to be UTF-8, so that problem does
not apply.

Ah. I interpreted the 'different' more liberally than you in the
"use strcmp() between strings in different encodings". I viewed
that to imply it might be any of strcmp(ascii, ascii),
strcmp(ascii, utf8), strcmp(utf8, ascii), or strcmp(utf8, utf8) -
i.e. the strings (both of them, independently of each other)
might be in different encodings at different times. (Such as if
you open two arbitrary files in order to cmp/diff them.)

A UTF-8 encoded BOM at the start of a file is almost certain to mean
the file is UTF-8, and many programs do insert it silently to ensure
that other programs can recognize the encoding.

Almost certain is not good enough to justify an absolute.

Of course, UTF-8 doesn't need a byte 'order' other than the
monotonic order of the bytes themselves.

Of course. However, since the BOM convention developed for UTF-16
files and was successful at marking the encoding used, it was a
logical extension to do the same with UTF-8.

As a purely historical aside - was it a logical extension (i.e.
something that was thought about and designed in advanced), or
did people simply realise that when they ran BOM'ed UTF-16
files through their naive UTF-16..UTF-8 converter, those three
bytes were consistently squirted out first and that could be
post-facto used as a file type identifier? If the latter, then
had they been using a less naive UTF-16..UTF-8 converter which
recognised the /semantics/ of the BOM and justifiably dropped it
upon conversion, the UTF-8 BOM may never have been invented?

Note the same problem exists with UTF-16 BOMs: a file in some other
encoding could potentially start with FE FF or FF FE. And how do you
distinguish a BOM from a file without a BOM that starts with a
Zero-Width No-Break Space? It's unlikely but theoretically possible.

Absolutely.

Without the BOM, a program has to guess what encoding is used for
the file based on heuristics or format-specific information in the
file -- and they're not particularly good at it, in my experience.

Ditto. In some ways incompatibility would have been better, as
then there'd be no exceptional corner cases.

While I can appreciate the sentiment, I also understand that "perfect"
is the enemy of "good enough". If it's absolutely critical that you
not misinterpret a file as UTF-8 when it's not, provide the user (or
file format) with a mechanism to explicitly specify the encoding --
but users will appreciate having the "automagic" encoding detection
when it works.

This is one reason why I much prefer my code talking to
hardware rather than to users! :-D

Phil
--
I tried the Vista speech recognition by running the tutorial. I was
amazed, it was awesome, recognised every word I said. Then I said the
wrong word ... and it typed the right one. It was actually just
detecting a sound and printing the expected word! -- pbhj on /.
.



Relevant Pages

  • UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug
    ... Here is a way to do it by analysing the BOM in the character semantics ... of the encoding layer on the open handle without touching any layers ...
    (comp.lang.perl.misc)
  • Re: client side script and encoding
    ... the only reason why I put the client-side script generation in an aspx page ... was to have better control of the IIS output encoding than a flat file gives ... Seeing the BOM in the stream, ...
    (microsoft.public.scripting.jscript)
  • Re: UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug
    ... The passed in handle can have any encoding, ... There won't be a bom on it so SEEK_SET is just 0. ... # are used when analysing the BOM. ... ## Read in $MAX_BOM_LENGTH characters. ...
    (comp.lang.perl.misc)
  • Re: Defacto standard string library
    ... I challenge you to find a file in the wild that starts with that sequence where it is _not_ a BOM encoded in UTF-8. ... it clearly is not absolute. ... that other programs can recognize the encoding. ...
    (comp.lang.c)
  • Re: How to detect text file encoding in Perl
    ... The BOM is the relevant encoding of the Unicode character U+FEFF. ... and in utf-8 it's *three* bytes. ... the Unicode FAQ points out that utf-8 can usefully ...
    (comp.lang.perl.misc)