Re: Sorting a list of Unicode strings?



oliver@xxxxxxxxxxxx <oliver@xxxxxxxxxxxx> wrote:
...
Maybe I'm missing something fundamental here, but if I have a list of
Unicode strings, and I want to sort these alphabetically, then it
places those that begin with unicode characters at the bottom.
...
Anyway, I know _why_ it does this, but I really do need it to sort
them correctly based on how humans would look at it.

Depending on the nationality of those humans, you may need very
different sorting criteria; indeed, in some countries, different sorting
criteria apply to different use cases (such as sorting surnames versus
sorting book titles, etc; sorry, I don't recall specific examples, but
if you delve on sites about i18n issues you'll find some).

In both Swedish and Danish, I believe, A-with-ring sorts AFTER the
letter Z in the alphabet; so, having Åaland (where I'm using Aa for
A-with-ring, since this newsreader has some problem in letting me enter
non-ascii characters;-) sort "right at the bottom", while it "doesn't
look right" to YOU (maybe an English-speaker?) may look right to the
inhabitants of that locality (be they Danes or Swedes -- but I believe
Norwegian may also work similarly in terms of sorting).

The Unicode consortium does define a standard collation algorithm (UCA)
and table (DUCET) to use when you need a locale-independent ordering; at
<http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm>
you'll be able to obtain James Tauber's Python implementation of UCA, to
work with the DUCET found at
<http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm>.

I suspect you won't like the collation order you obtain this way, but
you might start from there, subsetting and tweaking the DUCET into an
OUCET (Oliver Unicode Collation Element Table;-) that suits you better.

A simpler, rougher approach, if you think the "right" collation is
obtained by ignoring accents, diacritics, etc (even though the speakers
of many languages that include diacritics, &c, disagree;-) is to use the
key=coll argument in your sorting call, passing a function coll that
maps any Unicode string to what you _think_ it should be like for
sorting purposes. The .translate method of Unicode string objects may
help there: it takes a dict mapping Unicode ordinals to ordinals or
string (or None for characters you want to delete as part of the
translation).

For example, suppose that what we want is the following somewhat silly
collation: we only care about ISO-8859-1 characters, and want to ignore
for sorting purposes any accent (be it grave, acute or circumflex),
umlauts, slashes through letters, tildes, cedillas. htmlentitydefs has
a useful dict called codepoint2name that helps us identify those "weirdy
decorated foreign characters".

def make_transdict():
import htmlentitydefs
cp2n = htmlentitydefs.codepoint2name
suffixes = 'acute crave circ uml slash tilde cedil'.split()
td = {}
for x in range(128, 256):
if x not in cp2n: continue
n = cp2n[x]
for s in suffixes:
if n.endswith(s):
td[x] = unicode(n[-len(s)])
break
return td

def coll(us, td=make_transdict()):
return us.translate(td)

listofus.sort(key=coll)


I haven't tested this code, but it should be reasonably easy to fix any
problems it might have, as well as making make_transdict "richer" to
meet your goals. Just be aware that the resulting collation (e.g.,
sorting a-ring just as if it was a plain a) will be ABSOLUTELY WEIRD to
anybody who knows something about Scandinavian languages...!!!-)


Alex
.



Relevant Pages

  • Re: Question of Unicode programming
    ... Second even if the Unicode string is received the console would ... Console window does not understand Unicode data. ... To display chinese characters is that convert that Unicode characters to ...
    (microsoft.public.vc.mfc)
  • Re: How to find number of characters in a unicode string?
    ... > Decode the byte string and use `len` on the unicode string. ... That's correct, these are two unicode characters, C and combining-cedilla; display as Ç. ... These make conversion to and from legacy encodings simpler and allow applications to use Unicode as an internal text format without having to implement combining characters. ...
    (comp.lang.python)
  • Re: utf8 and ftplib
    ... I'm still not getting this unicode business. ... and this Python script: ... "Returns a unicode string with all the non-ascii characters from the ...
    (comp.lang.python)
  • Re: how to copy/paste international strings to the VS editor?
    ... you should be able to paste the characters ... into the MultiByte page and see what the representation is in Unicode. ... Unicode box is supported, and you'd have to hand-edit it into \x strings. ... am able to convert the Chinese unicode string to its Chinese GB code. ...
    (microsoft.public.vc.mfc)
  • Re: Unicode Support
    ... >> (I know this is a poor example, but think about other languages, eg ... First things first, when you register your RosAsm windows classes, you ... the messages with ANSI / UNICODE parameters in ANSI or UNICODE form... ... with their alphabet characters, as with the numbers and punctuation...so, ...
    (alt.lang.asm)

Loading