Re: Judge the encode systm used by the file.
- From: rlb@xxxxxxxxxxxxxxxxxxxxxx (Richard Bos)
- Date: Fri, 31 Oct 2008 10:58:37 GMT
richard@xxxxxxxxxxxxxxx (Richard Tobin) wrote:
Richard Bos <rlb@xxxxxxxxxxxxxxxxxxxxxx> wrote:
Possibly, but are you willing to rely on this, given the thousands of
languages out there, most of them, _unlike_ English, written in a Latin
script which uses diacritics to a greater or smaller degree?
Yes. It's very unlikely that all the sequences of 8859 characters used
in such a document will be legal UTF-8.
The heuristic is: if the file contains bytes >= 128, and it would be
legal UTF-8, then it's very likely that it *is* UTF-8. As I said,
I would be interested if you can come up with any real document for
which this heuristic fails.
*Shrug* You speak English, and you're willing to take that risk. I speak
a language which _does_ use diacritics, and I'm not.
Richard
.
- Follow-Ups:
- Re: Judge the encode systm used by the file.
- From: Richard Tobin
- Re: Judge the encode systm used by the file.
- References:
- Judge the encode systm used by the file.
- From: Hongyi Zhao
- Re: Judge the encode systm used by the file.
- From: Richard Bos
- Re: Judge the encode systm used by the file.
- From: Richard Tobin
- Re: Judge the encode systm used by the file.
- From: Richard Bos
- Re: Judge the encode systm used by the file.
- From: Richard Tobin
- Judge the encode systm used by the file.
- Prev by Date: 9 / 11 payback time
- Next by Date: Re: pageup/pagedwn implementation
- Previous by thread: Re: Judge the encode systm used by the file.
- Next by thread: Re: Judge the encode systm used by the file.
- Index(es):
Relevant Pages
|