Re: PEP 263 status check

From: Hallvard B Furuseth (h.b.furuseth_at_usit.uio.no)
Date: 08/08/04


Date: 08 Aug 2004 01:05:07 +0200

John Roth wrote:
>"Hallvard B Furuseth" <h.b.furuseth@usit.uio.no> wrote in message
>news:HBF.20040806qchc@bombur.uio.no...
>>An addition to Martin's reply:
>>John Roth wrote:
>>>"Martin v. Lwis" <martin@v.loewis.de> wrote in message
>>>news:41137799.70808@v.loewis.de...
>>>>
>>>> To be more specific: In an UTF-8 source file, doing
>>>>
>>>> print "" == "\xc3\xb6"
>>>> print ""[0] == "\xc3"
>>>>
>>>> would print two times "True", and len("") is 2.
>>>> OTOH, len(u"")==1.
>>>
>>> (...)
>>> I'd expect that the compiler would reject anything that
>>> wasn't either in the 7-bit ascii subset, or else defined
>>> with a hex escape.
>>
>> Then you should also expect a lot of people to move to
>> another language - one whose designers live in the real
>> world instead of your Utopian Unicode world.
>
> Rudeness objection to your characteization.

Sorry, I guess that was a bit over the top. I've just gotten so fed up
with bad charset handling, including over-standardization, over the
years. And as you point out, I misunderstood the scope of your
suggestion. But you have been saying that people should always use
Unicode, and things like that.

> Please see my response to Martin - I'm talking only,
> and I repeat ONLY, about scripts that explicitly
> say they are encoded in utf-8. Nothing else. I've
> been in this business for close to 40 years, and I'm
> quite well aware of backwards compatibility issues
> and issues with breaking existing code.
>
> Programmers in general have a very strong, and
> let me repeat that, VERY STRONG assumption
> that an 8-bit string contains one byte per character
> unless there is a good reason to believe otherwise.

Often true in our part of the world. However, another VERY STRONG
assumption is that if we feed the computer a raw character string and
ensure that it doesn't do any fancy charset handling, the program won't
mess with the string and things will Just Work. Well, except that
programs that strip the 8. bit is a problem. While there is no longer
any telling what a program will do if it gets the idea that it can be
helpful about the character set.

The biggest problem with labeling anything as Unicode may be that it
will have to be converted back before it is output, but the program
often does not know which character set to convert it to. It might not
be running on a system where "the charset" is available in some standard
location. It might not be able to tell from the name of the locale. In
any case, the desired output charset might not be the same as that of
the current locale. So the program (or some module it is using) can
decide to guess, which can give very bad results, or it can fail, which
is no fun either. Or the programmer can set a default charset, even
though he does not know that the user will be using this charset. Or
the program can refuse to run unless the user configures the charset,
which is often nonsense.

The rest of my reply to that grew to a rather large rant with very
little relevance to PEP 263, so I moved it to the end of this message.

Anyway, the fact remains that in quite a number of situations, the
simplest way to do charset handling is to keep various programs firmly
away from charset issues. If a program does not know which charset is
in use, the best way is to not do any charset handling. In the case of
Python strings, that means 'str' literals instead of u'Unicode'
literals. Then the worst that can happen if the program is run with an
unexpected charset/encoding is that the strings built into the program
will not be displayed correctly.

It would be nice to have a machinery to tag all strings, I/O channels
and so on with their charset/encoding and with what to do if a string
cannot be converted to that encoding, but lacking that (or lacking
knowledge of how to tag some data), no charset handling will remain
better than guesstimate charset handling in some situations.

> This assumption is built into various places, including
> all of the string methods.

I don't agree with that, but maybe it's a matter of how we view function
and type names, or something.

> The current design allows accidental inclusion of
> a character that is not in the 7bit ascii subset ***IN
> A PROGRAM THAT HAS A UTF-8 CHARACTER
> ENCODING DECLARATION*** to break that
> assumption without any kind of notice. That in
> turn will break all of the assumptions that the string
> module and string methods are based on. That in
> turn is likely to break lots of existing modules and
> cause a lot of debugging time that could be avoided
> by proper design.

For programs that think they work with Unicode strings, yes. For
programs that have no charset opinion, quite the opposite is true.

>> And tell me why I shouldn't be allowed to work easily with raw
>> UTF-8 strings, if I do use coding:utf-8.
>
> First, there's nothing that's stopping you. All that
> my proposal will do is require you to do a one
> time conversion of any strings you put in the
> program as literals. It doesn't affect any other
> strings in any other way at any other time.

It is not a one-time conversion if it's inside a loop or a small
function which is called many times. It would have to be moved out
to a global variable or something, which makes the program a lot more
cumbersome.

Second, any time one has to write more complex expressions to achieve
something, it becomes easier to introduce bugs. In particular when
people's solution will sometimes be to write '\xc3\xb8' instead of ''
and add a comment with the real string. If the comment is wrong, which
happens, the bug may survive for a long time.

> I'll withdraw my objection if you can seriously
> assure me that working with raw utf-8 in
> 8-bit character string literals is what most programmers
> are going to do most of the time.

Of course it isn't. Nor is working with a lot of other Python features.

> I'm not going to accept the very common need
> of converting unicode strings to 8-bit strings so
> they can be written to disk or stored in a data base
> or whatnot (or reversing the conversion for reading.)

That's your choice, of course. It's not mine.

> That has nothing to do with the current issue - it's
> something that everyone who deals with unicode
> needs to do, regardless of the encoding of the
> source program.

I'm not not even sure which issue is the 'current issue',
if it makes that irrelevant.

========

<rant>
  I've been a programmer for about 20 years, and for most of that time
  the solution to charset issues in my environment (Tops-20, Unix, no
  multi-cultural issues) has been for the user to take care of the
  matter.
  
  At first, the computer thought it was using ASCII, we were using
  terminals and printers with NS_4551-1 - not that I knew a name for it
  - and that was that. (NS_4551-1 is ASCII with [\]{|} replaced with
  .) If we wanted to print an ASCII file, there might be a switch
  to get an ASCII font, we might have an ASCII printer/terminal, or we
  just learned to read as [\] and vice versa. A C program which
  should output a Norwegian string would use [\\] as - or the other
  way around, depending on how one displayed the program.
  
  Then some programs began to become charset-aware, but they "knew" that
  we were using ASCII, and began to e.g. label everyone's e-mail
  messages with "X-Charset: ASCII" or something. So such labels came in
  practice to mean 'any character set'. The solution was to ignore that
  label and get on with life. Maybe a program had to be tweaked a bit
  to achieve that, but usually not. And it might or might not be
  possible to configure a program to label things correctly, but since
  everyone ignored the label anyway, who cared?
  
  Then 8-bit character sets and MIME arrived, and the same thing
  happened again: 'Content-Type: text/plain; charset=iso-8859-1' came to
  mean 'any character set or encoding'. After all, programmers knew
  that this was the charset everyone was using if they were not using
  ASCII. This time it couldn't even be blamed on poor programmers: If I
  remember correctly, MIME says the default character set is ASCII, so
  programs _have_ to label 8-bit messages with a charset even if they
  have no idea which charset is in use. Programs can make the charset
  configurable, of course, but most users didn't know or care about such
  things, so that was really no help.
  
  Fortunately, most programs just displayed the raw bytes and ignored
  the charset, so it was easy to stay with the old solution of ignoring
  charset labels and get on with life. Same with e.g. the X window
  system: Parts of it (cut&paste buffers? Don't remember) was defined to
  work with latin-1, but NS_4551-1 fonts worked just fine. Of course,
  if we pasted from an NS_4551-1 window to a latin-1 window we got
  {|}, but that's was what we had learned to expect anyway. I don't
  remember if one had to to some tweaking to convince X not to get
  clever, but I think not.
  
  Locales arrived too, and they might be helpful - except several
  implementations were so buggy that programs crashed or misbehaved if
  one turned them on. Also, it might or might not be possible to deduce
  which character set was in use from the names of the locales. So, on
  many machines, ignore them and move on.
  
  Then UTF-8 arrived, and things got messy. We actually begun to need
  to deal with different encodings as well as character sets.

  UTF-8 texts labeled as iso-8859-1 (these still occur once in a while)
  have to be decoded, it's not enough to switch the window's font if the
  charset is wrong. Programs expecting UTF-8 would get a parse error on
  iso-8859-1 input, it was not enough to change font.

  There is a Linux box I'm sometimes doing remote logins to which I
  can't figure out how to display non-ASCII characters. It insist that
  my I'm using UTF-8. My X.11 font is latin-1. I can turn off the
  locale settings, but then 8-bit characters are not displayed at all.
  I'm sure there is some way to fix that, but I haven't bothered to find
  out. I didn't need to dig around in manuals to find out that sort of
  thing before.

  I remember we had 3 LDAPv2 servers running for a while - one with
  UTF-8, one with iso-8859-1, and one with T.61, which is the character
  set which the LDAPv2 standard actually specified. Unless the third
  server used NS_4551-1; I don't remember.
  
  I've mentioned elsewhere that I had to downgrade Perl5.8 to a Unicode-
  unaware version when my programs crashed. There was a feature to turn
  off Unicode, but it didn't work. It seems to work in later versions.
  Maybe it's even bug-free this time. I'm not planning to find out,
  since we can't risk that these programs produce wrong output.

  And don't get me started on Emacs MULE, a charset solution so poor
  that from what I hear even Microsoft began to abandon it a decade
  earlier (code pages). For a while the --unibyte helped, but after a
  while that got icky. Oh well, most of the MULE bugs seem to be gone
  now, after - is it 5 years?

  The C language recently got both 8-bit characters and Unicode tokens
  and literals (\unnnn). As far as I can tell, what it didn't get was
  any provision for compilers and linkers which don't know which
  character set is in use and therefore can't know which native
  character should be translated to which Unicode character or vice
  versa. So my guess is that compilers will just pick a character set
  which seems likely if they aren't told. Or use the locale, which may
  have nothing at all to do with which character set the program source
  code is using. I may be wrong there, though; I only remember some of
  the discussions on comp.std.c, I haven't checked the final standard.
</rant>

Of course, there are a lot of good sides to the story too - even locales
got cleaned up a lot, for example. And you'd get a very different story
from people in different environments (e.g. multi-cultural ones) or with
different operating systems even in Norway, but you already know that.

-- 
Hallvard