Re: unichr() question
From: Martin v. Löwis (martin_at_v.loewis.de)
Date: 11/05/03
- Next message: Jerry Sumpton: "Outlook add-in required to sort messages"
- Previous message: vincent wehren: "Re: Unicode and Zipfile problems"
- In reply to: Ezequiel, Justin: "RE: unichr() question"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 05 Nov 2003 20:27:59 +0100
"Ezequiel, Justin" <j.ezequiel@spitech.com> writes:
> I am converting XML files with entities to utf-8 using a lookup table:
>
> ⏞ 0FE37
> ⏟ 0FE38
> <sc>O</sc> 1D4AA
The last one is not an XML entity reference, of course. Also, you are
not converting to UTF-8, atleast not in this table - you convert to
Unicode code points.
> I have no idea what I am doing but I sure think that I absolutely
> need it.
If you eventually need UTF-8, you might just as well create a mapping
table that translates to UTF-8.
> Can you explain more on non-BMP characters (and Python's
> capabilities to represent these) and how it applies (if it does) to
> my needs?
Well, the BMP (basic multilingual plane) is the first 65536 characters
of Unicode. Recent Unicode revisions added characters beyond the first
64k, for characters rarely used; the MathML characters got allocated
there as well.
Python traditionally was using a two-byte type to represent Unicode,
so it cannot represent characters outside the BMP, atleast not in
Unicode strings of length 1. If you compile Python with --enable-ucs4,
you can readily represent all these characters. If you have only
UCS-2, you need two-character surrogate pairs to represent non-BMP
characters; this is called UTF-16.
If you want to learn more about UTF-16, see
http://www.wikipedia.org/wiki/UTF-16
http://www.faqs.org/rfcs/rfc2781.html
Python supports UTF-16 in the following contexts:
- encoding and decoding surrogate pairs in the UTF-8 codec
- representing surrogate pairs as a single \U unicode string
escape sequence.
Other aspects of UTF-16, such as distinguishing between the length of
a string in code points vs. the length of the string in code units are
not considered.
Regards,
Martin
- Next message: Jerry Sumpton: "Outlook add-in required to sort messages"
- Previous message: vincent wehren: "Re: Unicode and Zipfile problems"
- In reply to: Ezequiel, Justin: "RE: unichr() question"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|