Re: LWP and Unicode
- From: "Peter J. Holzer" <hjp-usenet2@xxxxxx>
- Date: Wed, 11 Oct 2006 23:04:40 +0200
On 2006-10-11 18:08, Ted Zlatanov <tzz@xxxxxxxxxxxx> wrote:
On 10 Oct 2006, hjp-usenet2@xxxxxx wrote:[...]
On 2006-10-10 18:15, Jürgen Exner <jurgenex@xxxxxxxxxxx> wrote:
Ted Zlatanov wrote:
2) why not do a vote to change the charter to make UTF-8 the charset
for c.l.p.m?
Seriously: As much as I liked the USEFOR proposal to make UTF-8 the
default charset (instead of ASCII) on usenet, and as much as I dislike
MIME, I don't think declaring UTF-8 to be the default charset for a
single group would be a good idea. Charsets should be properly declared
in a MIME Content-Type header. As long as the charset is correctly
encoded, I think any reasonably widespread charset (and that includes
UTF-8) should be acceptable.
Thanks, Peter. I agree with all you said, except I think UTF-8 is not
a charset, contrary to what MIME claims, right?
Right. The terminology is a mess. What MIME calls a "charset" is more
commonly known as a "character encoding". (In fact I thought that one of
the MIME RFCs mentions that, I can't find it right now)
When I try to explain that stuff I distinguish between
* character set - a set of characters in the mathematical sense, i.e.
unordered.
* coded character set - as above, but each character is associated with
a numerical code.
* character encoding - a particular mapping of a coded character set
onto sequences of octets (or bits). MIME calls this a "charset", the
Unicode standard calls it a "transformation format".
I use the term "charset" only when I talk about MIME, otherwise I talk
of "(coded) character sets" and don't abbreviate them to "charset".
UCS is the charset, UTF-8 is an encoding. Is UCS the real charset
when Content-Type specifies "charset=utf-8"?
Yes.
This layers bizarrely on top of the MIME Content-Transfer-Encoding, of
course. Will UCS data be encoded twice in the end?
This can happen, yes. If a message with UTF-8 content is to be
transmitted over a channel which isn't 8bit clean, a
Content-Transfer-Encoding of quoted-printable or base64 must be applied.
Think of UTF-8 as a mapping from a sequence of 16-bit (or 32-bit)
quantities onto a sequence of 8-bit quantities, and quoted-printable or
base64 as a mapping from a sequence of 8-bit quantities onto a sequence
of 7-bit quantities. (The remaining Content-Transfer-Encodings 7bit,
8bit and binary are transparent)
(Of course it doesn't stop there: SMTP and NNTP do a trivial bit of
extra encoding ("dot-stuffing"), TCP and IP only paste their headers
before chunks of data, but PPP for example is a bit more complicated,
and I don't really want to know what a DSL modem does to my precious
bits :-)).
hp
--
_ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
|_|_) | Sysadmin WSR | > ist?
| | | hjp@xxxxxx | Was sonst wäre der Sinn des Erfindens?
__/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
.
- References:
- LWP and Unicode
- From: Dale
- Re: LWP and Unicode
- From: Dale
- Re: LWP and Unicode
- From: Mumia W. (reading news)
- Re: LWP and Unicode
- From: Dale
- Re: LWP and Unicode
- From: Ben Morrow
- Re: LWP and Unicode
- From: Ted Zlatanov
- Re: LWP and Unicode
- From: Jürgen Exner
- Re: LWP and Unicode
- From: Peter J. Holzer
- Re: LWP and Unicode
- From: Ted Zlatanov
- LWP and Unicode
- Prev by Date: Standard output problem
- Next by Date: Re: Firefox Won't Execute My Perl Script
- Previous by thread: Re: LWP and Unicode
- Next by thread: Re: LWP and Unicode
- Index(es):
Relevant Pages
|