Re: accessing individual characters in unicode strings



On Apr 12, 3:45 pm, Peter Robinson <pe...@xxxxxxxxxxxxxxx> wrote:
Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or και if your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
character)

Should be simple, yes?
turns out to be near impossible. I tried using a simple index
character routine such as ustr[0]..ustr[1]... and this gives rubbish.
So I use len() to find out how long my simple greek string is, and of
course it is NOT three characters long.

The utf8-encoded incarnation is three characters long and it's six
bytes long. utf-8 is not unicode.


A day of intensive searching around the lists tells me that unicode
and python is a moving target: so many fixes are suggested for similar
problems, none apparently working with mine.

Here is the best I can do, so far
I convert the utf8 string using
ustr = repr(unicode(thisword, 'iso-8859-7'))

Don't do that. If you have a utf8 string, convert it to unicode like
this:

ustr = unicode(the_utf8_string, 'utf8')

If you have a string encoded in iso-8859-7, convert it to unicode like
this:

ustr = unicode(the_iso_8859_7_string, 'iso-8859-7')

Then inspect it like this:
print repr(ustr)

Here's a sample interactive session:

thisword = '\xce\xba\xce\xb1\xce\xb9'
ustr = unicode(thisword, 'utf8')
len(ustr)
3
print repr(ustr)
u'\u03ba\u03b1\u03b9'
import unicodedata
[unicodedata.name(x) for x in ustr]
['GREEK SMALL LETTER KAPPA', 'GREEK SMALL LETTER ALPHA', 'GREEK SMALL
LETTER IOTA']

Suggested reading: the Python Unicode HOWTO at http://www.amk.ca/python/howto/unicode

This may be handy: http://unicode.org/charts/PDF/U0370.pdf

HTH,
John
.



Relevant Pages

  • Re: Defacto standard string library
    ... Is there a defacto standard string library ... Unicode, encoded in UTF8 format, except that a zero byte is ... Standard C string functions will be fine with this ... result, it cannot be encoded using a single byte per character, unless ...
    (comp.lang.c)
  • Re: Determining if a string is Unicode
    ... there's nothing magic about Unicode. ... where each character occupies 2 bytes, as opposed to a Single-Byte Character ... You could load up a string with rubbish, ... > INF file like so: ...
    (microsoft.public.vb.general.discussion)
  • Re: Determining if a string is Unicode
    ... bytes per character, and MULTI-byte occupies one!!?? ... there's nothing magic about Unicode. ... You could load up a string with rubbish, ... if I read in the INF file from a 9x based computer the string does ...
    (microsoft.public.vb.general.discussion)
  • Re: Arabic or Chinese characters in a URL link give error copying
    ... the active ANSI character set, ... Arabic/Chinese then the associated "wide" Unicode characters will have been ... Function ContainsWideChars(ByRef inString As String) As Boolean ... Dim iCh As Integer ...
    (microsoft.public.vb.general.discussion)
  • Re: Arabic characters gives ASCII code 63
    ... The only problem is that you are looking at the ASCII/ANSI values i.e. assuming that each character is represented as a number between 0 and 255. ... This is hidden from the developer - the length of a 5 character string is still 5 but it's still 10 bytes. ... all you need to do is get the unicode value for each character rather than the ANSI number. ... Dim CellValue As String ...
    (microsoft.public.excel.programming)