Re: LWP and Unicode



On 2006-10-11 18:08, Ted Zlatanov <tzz@xxxxxxxxxxxx> wrote:
On 10 Oct 2006, hjp-usenet2@xxxxxx wrote:
On 2006-10-10 18:15, Jürgen Exner <jurgenex@xxxxxxxxxxx> wrote:
Ted Zlatanov wrote:
2) why not do a vote to change the charter to make UTF-8 the charset
for c.l.p.m?
[...]
Seriously: As much as I liked the USEFOR proposal to make UTF-8 the
default charset (instead of ASCII) on usenet, and as much as I dislike
MIME, I don't think declaring UTF-8 to be the default charset for a
single group would be a good idea. Charsets should be properly declared
in a MIME Content-Type header. As long as the charset is correctly
encoded, I think any reasonably widespread charset (and that includes
UTF-8) should be acceptable.

Thanks, Peter. I agree with all you said, except I think UTF-8 is not
a charset, contrary to what MIME claims, right?

Right. The terminology is a mess. What MIME calls a "charset" is more
commonly known as a "character encoding". (In fact I thought that one of
the MIME RFCs mentions that, I can't find it right now)

When I try to explain that stuff I distinguish between

* character set - a set of characters in the mathematical sense, i.e.
unordered.

* coded character set - as above, but each character is associated with
a numerical code.

* character encoding - a particular mapping of a coded character set
onto sequences of octets (or bits). MIME calls this a "charset", the
Unicode standard calls it a "transformation format".

I use the term "charset" only when I talk about MIME, otherwise I talk
of "(coded) character sets" and don't abbreviate them to "charset".


UCS is the charset, UTF-8 is an encoding. Is UCS the real charset
when Content-Type specifies "charset=utf-8"?

Yes.

This layers bizarrely on top of the MIME Content-Transfer-Encoding, of
course. Will UCS data be encoded twice in the end?

This can happen, yes. If a message with UTF-8 content is to be
transmitted over a channel which isn't 8bit clean, a
Content-Transfer-Encoding of quoted-printable or base64 must be applied.
Think of UTF-8 as a mapping from a sequence of 16-bit (or 32-bit)
quantities onto a sequence of 8-bit quantities, and quoted-printable or
base64 as a mapping from a sequence of 8-bit quantities onto a sequence
of 7-bit quantities. (The remaining Content-Transfer-Encodings 7bit,
8bit and binary are transparent)

(Of course it doesn't stop there: SMTP and NNTP do a trivial bit of
extra encoding ("dot-stuffing"), TCP and IP only paste their headers
before chunks of data, but PPP for example is a bit more complicated,
and I don't really want to know what a DSL modem does to my precious
bits :-)).

hp

--
_ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
|_|_) | Sysadmin WSR | > ist?
| | | hjp@xxxxxx | Was sonst wäre der Sinn des Erfindens?
__/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
.



Relevant Pages

  • Re: Changing the default charset for composing messages
    ... > correct default for the localized version of Entourage you're using. ... > UTF-8 if your message contains characters from more than one character set. ... > will just choose the correct charset on the basis of the characters you've ...
    (microsoft.public.mac.office.entourage)
  • Re: DBD::mysql and UTF-8
    ... > data will still be inserted as UTF-8. ... > But then again, I need to set the utf8-flag on $result with decode(), ... is that Mysql has something called 'client character ... hope) that mysql would use database charset or table charset or even ...
    (comp.lang.perl.modules)
  • Re: utf8 output from database
    ... > set up to display that particular character. ... And I'm not sure UTF-8 ... The charset parameter doesn't 'do' anything. ... character repertoire applies. ...
    (comp.lang.php)
  • Re: LWP and Unicode
    ... There is no header in a Usenet article that specifies a ... but MIME is current practice on usenet. ... only charset declarations are widely used on usenet: ... So a newsreader may not need to ...
    (comp.lang.perl.misc)
  • Re: LWP and Unicode
    ... There is no header in a Usenet article that specifies a ... charset, so no way to use anything other than the default ASCII. ... but MIME is current practice on usenet. ... ASCII sequences which will do that, ...
    (comp.lang.perl.misc)