Re: Writing UTF-8 string to UNICODE file

From: Francis Avila (francisgavila_at_yahoo.com)
Date: 11/11/03


Date: Tue, 11 Nov 2003 13:43:20 -0500


"Michael Weir" <mweir@transres.com> wrote in message
news:4e9sb.154$s8.2312@news.on.tac.net...
> I'm sure this is a very simple thing to do, once you know how to do it,
but
> I am having no fun at all trying to write utf-8 strings to a unicode file.
> Does anyone have a couple of lines of code that
> - opens a file appropriately for output
> - writes to this file
> Thanks very much.
> Michael Weir

I don't quite understand, since you seem to be talking about "unicode" as if
it were a distinct encoding. Unicode is not an encoding, but a mapping of
numbers to meaningful symbolic representations (letters, numbers, whatever).
There's no such thing as a "unicode file", strictly speaking, because a file
is a byte stream and unicode has nothing to do with bytes. Of course,
loosely speaking, "unicode file" means "a file which uses one of those
byte-stream encodings by which any arbitrary subset of unicode code points
can be represented."

If you mean, "how do I encode a unicode string as utf-8", do like this:

>>> u"I'm a unicode string in utf-8 encoding.".encode('utf-8')
"I'm a unicode string in utf-8 encoding."

This serializes an ordered collection of unicode code points into a byte
stream, using the encoding method "utf-8". You want to write this byte
stream to a file? Go right ahead.

If you write a unicode string to something that wants a byte stream, I think
Python's internal representation of the unicode string object will get
serialized. (I'm not really sure what would happen, but it probably won't be
utf-8.) I doubt this is what you want. You have to encode the unicode
string first.

To avoid having to do explicit conversions for every unicode string you want
to write to a file, use codecs.open to open the file. This will wrap all
reads/writes in an encoder/decoder, and all reads will give you a unicode
string. However, I don't think you'll be able to write raw byte streams
anymore--even normal strings will be reencoded. Also, be sure not to
accidentally open the file using file() later--you'll be reading and writing
raw byte
streams, and will make a big mess of things.

Perhaps Python should have all "strings" be unicode strings, and make a
distinct "byte stream" type? This might make the "codepoint v.
representation" distinction cleaner and more explicit, and allow us to go
raw if we really want (although, mixing text and binary in a single file
isn't such a good idea). It'd also be incredibly messy to change things,
and less efficient if all you do is ascii text all day. Oh well.

--
Francis Avila


Relevant Pages

  • Re: Linguistically correct Python text rendering
    ... It doesn't matter what the encoding is. ... > issue is that for some writing systems simply outputting ... > the characters in a Unicode string, irrespective of encoding, will ...
    (comp.lang.python)
  • Re: RfD: XCHAR wordset (Version 3)
    ... In the Unicode world, they say "code point". ... sequence U+0065 U+0301, designate the same glyph. ... ASCII stream is also an UTF-8 stream with the same meaning. ... UTF-16 is an encoding over a stream of 16-bit code units. ...
    (comp.lang.forth)
  • Re: C# and encodings
    ... and they can be encoded into a binary stream using an encoding that either supports the full Unicode character set or an encoding that supports the subset that a codepage represents. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Can I make unicode in a repr() print readably?
    ... Unicode output if the stream supports it, ... UnicodeErrors if encoding them with the stream encoding fails. ...
    (comp.lang.python)
  • Encodings and printing unicode
    ... How does the print statement decode unicode strings itis passed? ... that I mean which encoding does it use). ... In my understanding unicode is an 'internal representation' - if you ... So when you 'print' a unicode string, ...
    (comp.lang.python)