Re: accessing individual characters in unicode strings
- From: John Machin <sjmachin@xxxxxxxxxxx>
- Date: Sat, 12 Apr 2008 02:33:58 -0700 (PDT)
On Apr 12, 3:45 pm, Peter Robinson <pe...@xxxxxxxxxxxxxxx> wrote:
Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or και if your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
character)
Should be simple, yes?
turns out to be near impossible. I tried using a simple index
character routine such as ustr[0]..ustr[1]... and this gives rubbish.
So I use len() to find out how long my simple greek string is, and of
course it is NOT three characters long.
The utf8-encoded incarnation is three characters long and it's six
bytes long. utf-8 is not unicode.
A day of intensive searching around the lists tells me that unicode
and python is a moving target: so many fixes are suggested for similar
problems, none apparently working with mine.
Here is the best I can do, so far
I convert the utf8 string using
ustr = repr(unicode(thisword, 'iso-8859-7'))
Don't do that. If you have a utf8 string, convert it to unicode like
this:
ustr = unicode(the_utf8_string, 'utf8')
If you have a string encoded in iso-8859-7, convert it to unicode like
this:
ustr = unicode(the_iso_8859_7_string, 'iso-8859-7')
Then inspect it like this:
print repr(ustr)
Here's a sample interactive session:
3thisword = '\xce\xba\xce\xb1\xce\xb9'
ustr = unicode(thisword, 'utf8')
len(ustr)
u'\u03ba\u03b1\u03b9'print repr(ustr)
['GREEK SMALL LETTER KAPPA', 'GREEK SMALL LETTER ALPHA', 'GREEK SMALLimport unicodedata
[unicodedata.name(x) for x in ustr]
LETTER IOTA']
Suggested reading: the Python Unicode HOWTO at http://www.amk.ca/python/howto/unicode
This may be handy: http://unicode.org/charts/PDF/U0370.pdf
HTH,
John
.
- References:
- accessing individual characters in unicode strings
- From: Peter Robinson
- accessing individual characters in unicode strings
- Prev by Date: Re: [ANN]: Python-by-Example updates
- Next by Date: Confused about Boost.Python & bjam
- Previous by thread: accessing individual characters in unicode strings
- Next by thread: Re: accessing individual characters in unicode strings
- Index(es):
Relevant Pages
|