Re: Judge the encode systm used by the file.



In article <K9ICop.LFE@xxxxxx>, Dik T. Winter <Dik.Winter@xxxxxx> wrote:

I think this is *very* rare. An English language file that uses a few
accented characters from 8859-something will not be legal UTF-8,
because in UTF-8 characters above 127 always come in groups of at
least two.

Who is talking about English language files?

See my other response.

I would be interested to see a real-life 8859 file that's also legal
UTF-8.

Start the other way. Every UTF-8 file is also a correct 8859 file. But if
you want to omit the higher control characters than every UTF-8 file that
does not contain a byte in the range 8000-801F is a correct 8859 file.

I know that. But in real life, the chances of a file being 8859
if it is legal as UTF-8 is negligible. You can distinguish the
two with high reliability by testing for legality as UTF-8.

-- Richard
--
Please remember to mention me / in tapes you leave behind.
.



Relevant Pages

  • Re: OT: Translate into English
    ... c " tse tse ... it doesn't display the characters correctly ... UTF-8, your response should also have been in UTF-8. ...
    (alt.usage.english)
  • Re: DB2 UTF-8 ODBC double conversion
    ... UTF-8 *is* Unicode. ... byte to store characters in the 7-bit ASCII code. ... If I give a UTF-8 string to CreateFile, ... this means that everyone who is using that database has to understand that the ...
    (microsoft.public.vc.mfc)
  • =?utf-8?B?UmU6IFN0cmluZyAiw6LigqzihKIiIHRyYW5zbGF0ZWQgdG8gYXBvc3Ryb3BoZS4gV2h5Pw==?=
    ... it works), though it seems to use mostly just Ascii characters, representing ... but the author is not making the best possible use of UTF-8. ... They don't map it to ASCII apostrophe, ... Latin 1 encoding. ...
    (alt.html)
  • [PATCH] UTF-8 input: composing non-latin1 characters, and copy-paste
    ... One can put the keyboard driver into Unicode mode, load a Unicode keymap, and get single keystrokes generate valid UTF-8 for non-ASCII characters. ...
    (Linux-Kernel)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... For any language using a Latin ... Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. ... But you'll find something that does a reasonable job and *will* work perfectly for most programmers who stick to ASCII identifiers. ... A related problem is if you are making identifiers case-insensitive - it's hard to figure out cases for non-ASCII characters. ...
    (comp.arch.embedded)