Re: LANG, locale, unicode, setup.py and Debian packaging



Martin,
Thanks, food for thought indeed.

On Unix, yes. On Windows, NTFS and VFAT represent file names as Unicode
strings always, independent of locale. POSIX file names are byte
strings, and there isn't any good support for recording what their
encoding is.
I get my filenames from two sources:
1. A wxPython treeview control (unicode build)
2. os.listdir() with a unicode path passed to it

I have found that os.listdir() does not always return unicode objects when
passed a unicode path. Sometimes "byte strings" are returned in the list,
mixed-in with unicodes.

I will try the technique given
on:http://www.pyzine.com/Issue008/Section_Articles/article_Encodings.html#guessing-the-encoding
Perhaps that will help.

Re os.listdir():
If you think you may have file names with mixed locales, and
the current locale might not match the file name's locale, you should
be using the byte string variant on Unix (which it seems you are already
doing).
I gather you mean that I should get a unicode path, encode it to a byte string
and then pass that to os.listdir
Then, I suppose, I will have to decode each resulting byte string (via the
detect routines mentioned in the link above) back into unicode - passing
those I simply cannot interpret.

Then, if the locale's encoding cannot decode the file names, you have
several options
a) don't try to interpret the file names as character strings, i.e.
don't decode them. Not sure why you need the file names - if it's
only to open the files, and never to present the file name to the
user, not decoding them might be feasible
So, you reckon I should stick to byte-strings for the low-level file open
stuff? It's a little complicated by my using Python Imaging to access the
font files. It hands it all over to Freetype and really leaves my sphere of
savvy.
I'll do some testing with PIL and byte-string filenames. I wish my memory was
better, I'm pretty sure I've been down that road and all my results kept
pushing me to stick to unicode objects as far as possible.

b) guess an encoding. For file names on Linux, UTF-8 is fairly common,
so it might be a reasonable guess.
c) accept lossy decoding, i.e. decode with some encoding, and use
"replace" as the error handler. You'll have to preserve the original
file names along with the decoded versions if you later also want to
operate on the original file.
Okay, I'm getting your drift.

That's not true. Try open("\xff","w"), then try interpreting the file
name as UTF-8. Some byte strings are not meaningful UTF-8, hence that
approach cannot work.
Okay.

That's correct, and there is no solution (not in Python, not in any
other programming language). You have to made trade-offs. For that,
you need to analyze precisely what your requirements are.
I would say the requirements are:
1. To open font files from any source (locale.)
2. To display their filename on the gui and the console.
3. To fetch some text meta-info (family etc.) via PIL/Freetype and display
same.
4. To write the path and filename to text files.
5. To make soft links (path + filename) to another path.

So, there's a lot of unicode + unicode and os.path.join and so forth going on.

I went through this exercise recently and had no joy. It seems the string
I chose to use simply would not render - even under 'ignore' and
'replace'.
I don't understand what "would not render" means.
I meant it would not print the name, but constantly throws ascii related
errors.

I don't know if the character will survive this email, but the text I was
trying to display (under LANG=C) in a python script (not the immediate-mode
interpreter) was: "MÖgul". The second character is a capital O with an umlaut
(double-dots I think) above it. For some reason I could not get that to
display as "M?gul" or "Mgul".
BTW, I just made that up - it means nothing (to me). I hope it's not a swear
word in some other language :)

As for font files - I don't know what encoding the family is in, but
I would sure hope that the format specification of the font file format
would also specify what the encoding for the family name is, or that
there are at least established conventions.
You'd think. It turns out that font file are anything but simple. I am doing
my best to avoid being sucked-into the black hole of complexity they
represent. I must stick to what PIL/Freetype can do. The internals of
font-files are waaaaaay over my head.

I would avoid locale.getlocale. It's a pointless function (IMO).
As a consequence, it will return None if it doesn't know better.
If all you want is the charset of the locale, use
locale.getpreferredencoding().
Brilliant summary - thanks a lot for that.

You could just leave out the languages parameter, and trust gettext
to find some message catalog.
Right - I'll give that a go.

This would mean cutting-out a percentage of the external font files that
can be used by the app.
See above. There are other ways to trade-off. Alternatively, you could
require that the program finds a richer locale, and bail out if the
locale is just "C".
That's kind of what the OP is all about. If I make this a 'design decision'
then it means I have a problem with the Debian packaging (and RPM?) rules
that require a "C" locale support.
I think I shall have to break the links between my setup.py and the rest of my
app - so that setup.py will allow LANG=C but the app (when run) will not.

That doesn't help. For Turkish in particular, the UTF-8 locale is worse
than the ISO-8859-9 locale, as the lowercase I takes two bytes in UTF-8,
so tolower can't really work in the UTF-8 locale (but can in the
ISO-8859-9 locale).
Wow. I still get cold chills -- but I assume that once the right encoding is
known this sort of thing will be okay.

Thanks again. It's coming together slowly.
\d
.



Relevant Pages

  • Re: How to check variables for uniqueness ?
    ... FI in English typography), so the correct uppercase version of those ... characters is the sequence SS. ... So you at least agree with me that it should be consistent with toUpperCase -- all strings should have a single canonical toUpperCase, a single canonical toLowerCase, both should define equivalence classes on the mixed-case input strings, these should be the SAME equivalence class, and equalsIgnoreCase should implement and embody the corresponding equivalence relation. ... The version that doesn't shouldn't surprise English speakers; the version that does shouldn't surprise anyone familiar with its locale-specific behavior for the locale actually used. ...
    (comp.lang.java.programmer)
  • Re: LANG, locale, unicode, setup.py and Debian packaging
    ... NTFS and VFAT represent file names as Unicode ... strings always, independent of locale. ... Then, if the locale's encoding cannot decode the file names, you have ...
    (comp.lang.python)
  • Re: LANG, locale, unicode, setup.py and Debian packaging
    ... encoding, and compute that encoding with locale.getpreferredencoding. ... the locale returns something like "ANSI" and I ... If I access the filename it throws a unicodeDecodeError. ... can't know if I am testing real-world strings or crazy Tolkein strings. ...
    (comp.lang.python)
  • Re: diferent answers with isalpha()
    ... execute a script file with the same code I get False. ... Python uses the "C" locale where the ... alphabetic characters are a-zA-z only. ... ASCII set is to use Unicode strings. ...
    (comp.lang.python)
  • Re: Can I use std::locale to solve this?
    ... isalphacorrectly identifies the Swedish characters å, ä, and ö as ... Regarding GCC I still haven't found out how to set the locale to Swedish, ... but I noticed that std::localedoesn't throw for unknown locale strings as ...
    (comp.lang.cpp)