Re: Judge the encode systm used by the file.



richard@xxxxxxxxxxxxxxx (Richard Tobin) writes:

In article <490ae4b8.604086120@xxxxxxxxxxxxxx>,
Richard Bos <rlb@xxxxxxxxxxxxxxxxxxxxxx> wrote:

The heuristic is: if the file contains bytes >= 128, and it would be
legal UTF-8, then it's very likely that it *is* UTF-8. As I said,
I would be interested if you can come up with any real document for
which this heuristic fails.

*Shrug* You speak English, and you're willing to take that risk. I speak
a language which _does_ use diacritics, and I'm not.

As *** Winter's (constructed) example indicates, the chance of error
is probably higher for English documents than for ones with a lot
of diacritics. The more non-ASCII characters you have, the lower
the chance of them accidentally being legal UTF-8.

It is not that hard to work out what is permitted and what is not.
For a file that uses an 8-bit single-byte encoding to look like valid
UTF-8 it must consist of sequences made up of the following patterns:

[01234567]x
[CD]x [89AB]x
Ex [89AB]x [89AB]x
F[01234567] [89AB]x [89AB]x [89AB]x

(this is a sort of made-up hex pattern notation).

For example, if any of the 8 characters F0 to F7 appears, it must be
followed by exactly three characters in the range 80 to BF. Any of
the 16 characters C0 to DF must be followed by exactly one such
character. These "follow-on" characters come to our aid, since half
of them are very rarely used control characters and the others are all
less than common (they are not letters for example).

Taking ISO-8859-1 as an example, the document can't include (anywhere)
thorn, small o with a slash, small y with either an acute or diaeresis
nor small y with any accent. In addition it can't have any accented
letter followed by either another one or by any "plain" character
whatsoever. Every small accented a, e or i (the Ex range) must be
followed by exactly two of the rather odd bunch like pilcrow, micro,
plus/minus etc. None of the "matching pairs" like « and », ¿ and ?
can be appear in a normal position (preceded by a space, newline or
tab for example). The best real-world use case I can see is a word
that has one and only one final accented character followed by
something like the registered symbol, the copyright symbol or maybe a
superscript number.

Other single-byte encodings (like the Chinese ones) might well have
patterns of use that do fit the requirements of the UTF-8 scheme, but
it is not likely to be common for the 8859 family.

--
Ben.
.