Re: How to get the ascii code of Chinese characters?
- From: Gerhard Fiedler <gelists@xxxxxxxxx>
- Date: Sat, 19 Aug 2006 18:02:15 -0300
On 2006-08-19 16:54:36, Peter Maas wrote:
Gerhard Fiedler wrote:
Well, ASCII can represent the Unicode numerically -- if that is what the OP
wants.
No. ASCII characters range is 0..127 while Unicode characters range is
at least 0..65535.
Actually, Unicode goes beyond 65535. But right in this sentence, you
represented the number 65535 with ASCII characters, so it doesn't seem to
be impossible.
For example, "U+81EC" (all ASCII) is one possible -- not very
readable though <g> -- representation of a Hanzi character (see
http://www.cojak.org/index.php?function=code_lookup&term=81EC).
U+81EC means a Unicode character which is represented by the number
0x81EC.
Exactly. Both versions represented in ASCII right in your message :)
UTF-8 maps Unicode strings to sequences of bytes in the range 0..255,
UTF-7 maps Unicode strings to sequences of bytes in the range 0..127.
You *could* read the latter as ASCII sequences but this is not correct.
Of course not "correct". I guess the only "correct" representation is the
original Chinese character. But the OP doesn't seem to want this... so a
non-"correct" representation is necessary anyway.
How to do it in Python? Let chinesePhrase be a Unicode string with
Chinese content. Then
chinesePhrase_7bit = chinesePhrase.encode('utf-7')
will produce a sequences of bytes in the range 0..127 representing
chinesePhrase and *looking like* a (meaningless) ASCII sequence.
Actually, no. There are quite a few code positions in the range 0..127 that
don't "look like" anything (non-printable). And, as you say, this is rather
meaningless.
chinesePhrase_16bit = chinesePhrase.encode('utf-16be')
will produce a sequence with Unicode numbers packed in a byte
string in big endian order. This is probably closest to what
the OP wants.
That's what you think... but it's not really ASCII. If you want this in
ASCII, and readable, I still suggest to transform this sequence of 2-byte
values (for Chinese characters it will be 2 bytes per character) into a
sequence of something like U+81EC (or 0x81EC if you are a C fan or 81EC if
you can imply the rest)... that's where we come back to my original
suggestion :)
Gerhard
.
- References:
- How to get the ascii code of Chinese characters?
- From: many_years_after
- Re: How to get the ascii code of Chinese characters?
- From: John Machin
- Re: How to get the ascii code of Chinese characters?
- From: many_years_after
- Re: How to get the ascii code of Chinese characters?
- From: Marc 'BlackJack' Rintsch
- Re: How to get the ascii code of Chinese characters?
- From: Gerhard Fiedler
- Re: How to get the ascii code of Chinese characters?
- From: Peter Maas
- How to get the ascii code of Chinese characters?
- Prev by Date: Re: How to get the ascii code of Chinese characters?
- Next by Date: Re: Documenting a package with Pydoc
- Previous by thread: Re: How to get the ascii code of Chinese characters?
- Next by thread: Re: How to get the ascii code of Chinese characters?
- Index(es):
Relevant Pages
|
Loading