Re: How to use 8bit character sets?



copx wrote:
For some reason Python (on Windows) doesn't use the system's default character set and that's a serious problem for me.
I need to process German textfiles (containing umlauts and other > 7bit ASCII characters) and generally work with strings which need to be processed using the local encoding (I need to display the text using a Tk-based GUI for example). The only solution I managed to find was converting between unicode and latin-1 all the time (the textfiles aren't unicode, the output of the program isn't supposed to be unicode either). Everything worked fine until I tried to run the program on a Windows 9x machine.. It seems that Python on Win9x doesn't really support unicode (IIRC Win9x doesn't have real unicode support so that's not suprising).
Is it possible to tell Python to use an 8bit charset (latin-1 in my case) for textfile and string processing by default?


copx


1. Your description of your problem is extremely vague. If you were to supply a minimal script that "works" [on what platform?? what version of Python??], with a description of what you understand by "works", and what happens differently when you run that script on a Win9x box [for what value(s) of x?? what version of Python??], we might be able to help you. N.B. somewhere near the top of the script you should have something like:

import sys
print "Python version:", sys.version
print "platform:", sys.platform
print "default encoding:", sys.getdefaultencoding()
try:
    print "Windows version:", sys.getwindowsversion()
except AttributeError:
    print "sys.getwindowsversion not available"

2. You should read this:

http://www.catb.org/~esr/faqs/smart-questions.html

3. You should not rely on a crutch like a default encoding, especially one obtained by a kludge like sitecustomize.py. If your app expects to receive data in encoding x and send data in encoding y, these facts are properties of the application and the data, NOT the box you are running on. If you had a requirement to read MacCyrillic from a Classic Mac and write KOI8 for consumption on a Windows PC, you should be able to do it on a SPARC Solaris box in Timbuktu or Walla Walla, Wa., without having to fiddle with site-wide configuration.

4. AFAIK, support for Unicode is provided by Python with no assistance from the operating system. The multitudinous deficiencies in Win9x should have no bearing on the problem. Have you tried to run your program on a Win2K or WinXP box?

HTH,

John
.



Relevant Pages

  • Re: "env" parameter to "popen" wont accept Unicode on Windows -minor Unicode bug
    ... Unicode to be handled automatically. ... Windows, and it knows what encoding Windows needs for its environment ... So the current code will handle Win9x, ...
    (comp.lang.python)
  • Re: C# and encodings
    ... different encoding than Unicode does (Unicode set uses three ... Any character encoding that is not Unicode by definition uses a different encoding than Unicode does. ... The point is that the Unicode "character" 0xfeff is not representable in any ANSI code page, and is treated specially by stripping it from input rather than replacing it with the "default character". ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Proposal: require 7-bit source strs
    ... If the application knows which encoding it is so it can convert at all, ... If you mean 'limited' to some other character set than Unicode, ... is that because you think of Unicode as The ... > standard grows with its adoption. ...
    (comp.lang.python)
  • Re: C# and encodings
    ... different encoding than Unicode does (Unicode set uses three ... encoding, and thus have only 255 code points matched to characters? ... But BOM should only be added when using one of Unicode encodings, ... at the beginning of a Unix script meaning ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: C# and encodings
    ... different encoding than Unicode does ... encoded into a binary stream using an encoding that either supports the ... So if code page supports only a subset of Unicode character set… ... characters as those in Unicode coded character set, ...
    (microsoft.public.dotnet.languages.csharp)