Re: unichr() question

From: Martin v. Löwis (martin_at_v.loewis.de)
Date: 11/05/03


Date: 05 Nov 2003 20:27:59 +0100


"Ezequiel, Justin" <j.ezequiel@spitech.com> writes:

> I am converting XML files with entities to utf-8 using a lookup table:
>
> &OverBrace; 0FE37
> &UnderBrace; 0FE38
> <sc>O</sc> 1D4AA

The last one is not an XML entity reference, of course. Also, you are
not converting to UTF-8, atleast not in this table - you convert to
Unicode code points.

> I have no idea what I am doing but I sure think that I absolutely
> need it.

If you eventually need UTF-8, you might just as well create a mapping
table that translates to UTF-8.

> Can you explain more on non-BMP characters (and Python's
> capabilities to represent these) and how it applies (if it does) to
> my needs?

Well, the BMP (basic multilingual plane) is the first 65536 characters
of Unicode. Recent Unicode revisions added characters beyond the first
64k, for characters rarely used; the MathML characters got allocated
there as well.

Python traditionally was using a two-byte type to represent Unicode,
so it cannot represent characters outside the BMP, atleast not in
Unicode strings of length 1. If you compile Python with --enable-ucs4,
you can readily represent all these characters. If you have only
UCS-2, you need two-character surrogate pairs to represent non-BMP
characters; this is called UTF-16.

If you want to learn more about UTF-16, see

http://www.wikipedia.org/wiki/UTF-16
http://www.faqs.org/rfcs/rfc2781.html

Python supports UTF-16 in the following contexts:
- encoding and decoding surrogate pairs in the UTF-8 codec
- representing surrogate pairs as a single \U unicode string
  escape sequence.

Other aspects of UTF-16, such as distinguishing between the length of
a string in code points vs. the length of the string in code units are
not considered.

Regards,
Martin



Relevant Pages

  • [PATCH] UTF-8 input: composing non-latin1 characters, and copy-paste
    ... One can put the keyboard driver into Unicode mode, load a Unicode keymap, and get single keystrokes generate valid UTF-8 for non-ASCII characters. ...
    (Linux-Kernel)
  • Re: Unicode string libraries
    ... UTF-8 is the encoding that must be used ... I initially thought that the variable-length characters ... but also that UTF-8 didn't break when Unicode got extended ...
    (comp.programming)
  • Re: Unicode string libraries
    ... I know that Perl uses UTF-8 as its internal string representation. ... characters defined within the BMP). ... search on UTF-8 encodings is equivalent to a search on Unicode ... it makes sense to choose other criteria for your internal encoding. ...
    (comp.programming)
  • Re: Fast UTF-8 strlen function
    ... >> Is there a fast UTF-8 string length function floating around? ... Length in bytes, or length in characters? ... For UTF-8, the main basic "change" you have to make to your string routines ... then I could individually look up the characters in my UNICODE ...
    (alt.lang.asm)
  • Re: unicode in ruby
    ... doesn't support unicode strings natively? ... (When Unix filesystems can write UTF-16 as ... to use decomposed characters instead of composed characters (e.g., ... even compress repetitive text which no encoding can. ...
    (comp.lang.ruby)