Re: unicode by default



On 12/05/2011 02:22, harrismh777 wrote:
John Machin wrote:
(1) You cannot work without using bytes sequences. Files are byte
sequences. Web communication is in bytes. You need to (know / assume / be
able to extract / guess) the input encoding. You need to encode your
output using an encoding that is expected by the consumer (or use an
output method that will do it for you).

(2) You don't need to use bytes to specify a Unicode code point. Just use
an escape sequence e.g. "\u0404" is a Cyrillic character.


Thanks John. In reverse order, I understand point (2). I'm less clear on
point (1).

If I generate a string of characters that I presume to be ascii/utf-8
(no \u0404 type characters) and write them to a file (stdout) how does
default encoding affect that file.by default..? I'm not seeing that
there is anything unusual going on... If I open the file with vi? If I
open the file with gedit? emacs?

....

Another question... in mail I'm receiving many small blocks that look
like sprites with four small hex codes, scattered about the mail...
mostly punctuation, maybe? ... guessing, are these unicode code points,
and if so what is the best way to 'guess' the encoding? ... is it coded
in the stream somewhere...protocol?

You need to understand the difference between characters and bytes.

A string contains characters, a file contains bytes.

The encoding specifies how a character is represented as bytes.

For example:

In the Latin-1 encoding, the character "£" is represented by the byte 0xA3.

In the UTF-8 encoding, the character "£" is represented by the byte sequence 0xC2 0xA3.

In the ASCII encoding, the character "£" can't be represented at all.

The advantage of UTF-8 is that it can represent _all_ Unicode
characters (codepoints, actually) as byte sequences, and all those in
the ASCII range are represented by the same single bytes which the
original ASCII system used. Use the UTF-8 encoding unless you have to
use a different one.

A file contains only bytes, a socket handles only bytes. Which encoding
you should use for characters is down to protocol. A system such as
email, which can handle different encodings, should have a way of
specifying the encoding, and perhaps also a default encoding.
.



Relevant Pages

  • Re: Keeping track of paper files
    ... ascii character encoding large among them. ... Yeah, there have been other encoding schemes, like EBSDIC, but even the ... codes were assigned to function as SHIFT/UNSHIFT characters... ...
    (soc.genealogy.misc)
  • =?utf-8?B?UmU6IFN0cmluZyAiw6LigqzihKIiIHRyYW5zbGF0ZWQgdG8gYXBvc3Ryb3BoZS4gV2h5Pw==?=
    ... it works), though it seems to use mostly just Ascii characters, representing ... but the author is not making the best possible use of UTF-8. ... They don't map it to ASCII apostrophe, ... Latin 1 encoding. ...
    (alt.html)
  • Re: what does "serialization" mean?
    ... it's the most important piece of the ASCII ... ANSI recognized that 128 characters were ... ASCII committee hasn't met to discuss character encoding formats for many, ... Space Invaders or LEM games. ...
    (comp.programming)
  • Re: Strange Characters When Viewing Outlook Express messages
    ... Messages Received in Outlook Express Have Different Characters in the ... messages in the default encoding format regardless of the actual encoding ... changed something with whatever they use to produce the emails. ...
    (microsoft.public.windowsxp.general)
  • Re: Apostrophe
    ... The octets do not belong to the ASCII range at all. ... referred to as ASCII characters, and this is a gross and essential error, though I tried to deal with it with silk gloves before you forced me to say it more explicitly. ... When working just within one data processing system and one 8-bit encoding, such misrepresentations are little more than terminological errors. ... On the contrary, it would be worse than not specifying the encoding at all, since when declared as ASCII encoded, the data (or at least all octets larger than 7F hexadecimal) should be treated as erroneous and malformed, instead of making heuristic guesses. ...
    (comp.infosystems.www.authoring.html)