Re: Trouble saving unicode text to file



John Machin wrote:
> Terminology disambiguation: what I call "users" wouldn't know what
> 'cp1252' and 'iso-8859-1' were. They're not expected to know. They
> just type in whatever characters they can see on their keyboard or
> find in the charmap utility. It's what I'd call 'admins' and
> 'developers' who should know better, but often don't.

I was talking about 'users' of Python, so they are 'developers'.
They often don't know what cp1252 is.

> 1. When exchanging data across systems, should not utf-8 be
> preferred???

It depends on the data, of course. People writing UTF-8 into
text files often find that their editors don't display them
correctly, in which case UTF-8 might not be the best choice.
For example, the Python source code in CVS is required to be
iso-8859-1, primarily because this is what interoperates best
across all development platforms.

For data in XHTML, the answer would be different: every XML
processor is supposed to support UTF-8.

> 2. If the Windows *users* have been using characters that are in
> cp1252 but not in iso-8859-1, then attempting to convert to iso-8859-1
> will cause an exception.

Correct.

> I find it a bit hard to imagine that the euro sign wouldn't get a fair
> bit of usage in Swedish data processing even if it's not their own
> currency.

Yes, so the question is how to represent it. It all depends on the
application, but it is safer to only assume iso-8859-1 for the moment,
unless it is guaranteed that all code that reads the file in really
knows what cp1252 is, and what \x80 means in that charset.

> 3. How portable is a character set that doesn't include the euro sign?

Well, how portable is ASCII? It doesn't support certain characters,
sure. If you don't need these characters, this is not a problem. If
you do need the extra characters, you need to think thoroughly what
encoding meets your needs best. I was merely suggesting that cp1252
is often used without that thought, causing moji-bake later.

If representation of the euro sign is an issue, the choices are
iso-8859-15, cp1252, and UTF-8. Of those three, I would pick
cp1252 last if at all possible, because it is specific to a
vendor (i.e. non-standard)

Regards,
Martin
.



Relevant Pages

  • =?utf-8?B?UmU6IFN0cmluZyAiw6LigqzihKIiIHRyYW5zbGF0ZWQgdG8gYXBvc3Ryb3BoZS4gV2h5Pw==?=
    ... it works), though it seems to use mostly just Ascii characters, representing ... but the author is not making the best possible use of UTF-8. ... They don't map it to ASCII apostrophe, ... Latin 1 encoding. ...
    (alt.html)
  • [PATCH] UTF-8 input: composing non-latin1 characters, and copy-paste
    ... One can put the keyboard driver into Unicode mode, load a Unicode keymap, and get single keystrokes generate valid UTF-8 for non-ASCII characters. ...
    (Linux-Kernel)
  • Re: Enhanced Unicode support for "Go" tools
    ... maybe Rene and Randy to note, perhaps - is an "ASCII compatible" ... version of UNICODE...in fact, for strict 7-bit ASCII, UTF-8 and ... characters so, being on Windows, that opinion makes great sense ... where the majority of the supported languages ...
    (alt.lang.asm)
  • Re: Special Characters in Query String
    ... I've had numerous problems with utf-8, ... in common characters in spanish not geting displayed. ... > available for encoding of characters. ... > If you can display your characters with ISO-8859-1, ...
    (microsoft.public.dotnet.framework.aspnet)
  • RichEdit EM_STREAMIN CP_UTF8 nulls out some input characters
    ... When I read a file encoded as UTF-8 into a RichEdit control, ... some of the characters from the input file are being replaced with nulls. ... LRESULT APIENTRY MainProc(HWND hwnd, UINT msg, WPARAM wparam, LPARAM lparam) ... WCHAR* fnp; ...
    (microsoft.public.win32.programmer.ui)