Re: Problem reading file with umlauts



Thanks a lot. Now I am one step further but I get another strange error:

Traceback (most recent call last):
File "./read.py", line 12, in <module>
of.write(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)

according to google ufeff has something to do with byte order.

I use an Linux system, maybe this helps to find the error.

Claus

Claus Hausberger wrote:

I have a text file with is encoding in Latin1 (ISO-8859-1). I can't
change that as I do not create those files myself. I have to read
those files and convert the umlauts like ö to stuff like &oumol; as
the text files should become html files.

umlaut-in.txt:
----
This file is contains data in the unicode
character set and is encoded with utf-8.
Viele Röhre. Macht spaß! Tsüsch!


umlaut-in.txt hexdump:
----
000000: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is con
000010: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in th
000020: 65 20 75 6E 69 63 6F 64 65 0D 0A 63 68 61 72 61 e unicode..chara
000030: 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 20 cter set and is
000040: 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 66 encoded with utf
000050: 2D 38 2E 0D 0A 56 69 65 6C 65 20 52 C3 B6 68 72 -8...Viele R..hr
000060: 65 2E 20 4D 61 63 68 74 20 73 70 61 C3 9F 21 20 e. Macht spa..!
000070: 20 54 73 C3 BC 73 63 68 21 0D 0A 00 00 00 00 00 Ts..sch!.......


umlaut.py:
----
# -*- coding: utf-8 -*-
import codecs
text=codecs.open("umlaut-in.txt",encoding="utf-8").read()
text=text.replace(u"ö",u"oe")
text=text.replace(u"ß",u"ss")
text=text.replace(u"ü",u"ue")
of=open("umlaut-out.txt","w")
of.write(text)
of.close()


umlaut-out.txt:
----
This file is contains data in the unicode
character set and is encoded with utf-8.
Viele Roehre. Macht spass! Tsuesch!


umlaut-out.txt hexdump:
----
000000: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is con
000010: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in th
000020: 65 20 75 6E 69 63 6F 64 65 0D 0D 0A 63 68 61 72 e unicode...char
000030: 61 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 acter set and is
000040: 20 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 encoded with ut
000050: 66 2D 38 2E 0D 0D 0A 56 69 65 6C 65 20 52 6F 65 f-8....Viele Roe
000060: 68 72 65 2E 20 4D 61 63 68 74 20 73 70 61 73 73 hre. Macht spass
000070: 21 20 20 54 73 75 65 73 63 68 21 0D 0D 0A 00 00 ! Tsuesch!.....





--
"The ability of the OSS process to collect and harness
the collective IQ of thousands of individuals across
the Internet is simply amazing." - Vinod Valloppillil
http://www.catb.org/~esr/halloween/halloween4.html

--
Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate
für nur 19,99 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02
.



Relevant Pages

  • Re: unicode
    ... 'ascii' codec can't encode character u'\u9999' in ... it looks like when I try to display the string, ... If you try to print a Unicode string, then Python will attempt to first ... encode it using the default encoding for that file. ...
    (comp.lang.python)
  • Re: wchar_t
    ... >> characters between the three major east asian languages. ... >> steam ahead with dropping Big5 and adopting Unicode pretty pervasively. ... > of effective character codes, ... Even if you wanted to encode ...
    (comp.lang.c)
  • Re: encoding to and from UTF-8
    ... I encode a unicode character as utf-8 but how do I convert back to ... % puts $hex ...
    (comp.lang.tcl)
  • Re: Python 3.0 crashes displaying Unicode at interactive prompt
    ... 1: character maps to ... When Python tries to display the character, it must first encode it ... Europe codec, which obviously can't encode an Asian character like the ... initially advised as "I got this UnicodeEncodeError" and accompanied ...
    (comp.lang.python)
  • Re: Encoding/decoding: Still dont get it :-/
    ... Unicode-encoded, 2) whether I should use encodeor decodeto solve ... 4: character maps to ... It seems the database gives you the strings as unicode. ... characters the cannot be expressed in that encoding. ...
    (comp.lang.python)