Unicode characters, XML/RSS



So I wrote a little video podcast downloading script that checks a
list of RSS feeds and downloads any new videos. Every once in a while
it find a character that is out of the 128 range in the feed and my
script blows up:

Traceback (most recent call last):
File "C:\Users\Adam\Desktop\Rev3 DL\Rev3.py", line 88, in <module>
mainloop()
File "C:\Users\Adam\Desktop\Rev3 DL\Rev3.py", line 75, in mainloop
update()
File "C:\Users\Adam\Desktop\Rev3 DL\Rev3.py", line 69, in update
couldhave = getshowlst(x[1],episodecnt)
File "C:\Users\Adam\Desktop\Rev3 DL\Rev3.py", line 30, in getshowlst
masterlist = XMLWorkspace.parsexml(url)
File "C:\Users\Adam\Desktop\Rev3 DL\XMLWorkspace.py", line 54, in
parsexml
parse(url, FeedHandlerInst)
File "C:\Python25\lib\xml\sax\__init__.py", line 33, in parse
parser.parse(source)
File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Python25\lib\xml\sax\expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "C:\Users\Adam\Desktop\Rev3 DL\XMLWorkspace.py", line 51, in
characters
self.data.append(string)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in
position 236: ordinal not in range(128)


Now its my understanding that XML can contain upper Unicode characters
as long as the encoding is specified, which it is (UTF-8). The feed
validates every validator I've ran it through, every program I open it
with seems to be ok with it, except my python script. Why? Here is
the URL of the feed in question: http://revision3.com/winelibraryreserve/
My script is complaining of the fancy e in Mourvèdre

At first glance I though it was the data.append(string) that was un
accepting of the Unicode, but even if I put a return in the Character
handler loop, it still breaks. What am I doing wrong?
.



Relevant Pages

  • Re: Grep and mv
    ... not some control character or the other ... > My silly little grep script extracts the names as ... Please post the script you used, ... >The space is not coming from within the file with this string. ...
    (comp.unix.shell)
  • Re: UTF-8 without external modules on Perl 5.0
    ... nothing about UTF-8 encoding/decoding in the stock modules of this ... so there is no way to have a character outside of the range ... So if you need to work with unicode strings in perl 5.005, ... verbatim in the script but make variables with their UTF-8 byte sequence ...
    (comp.lang.perl.misc)
  • Re: Why data could not be committed into table?
    ... I ran your repro script to recreate the tables and fill them ... select * from Feeds ... The data in markets and marketdef remained unchanged throughout the ... tables and execute the exact same update statement. ...
    (microsoft.public.sqlserver.programming)
  • FW: Google Reader "preview" and "lens" script improper feed validation
    ... Google Reader "preview" and "lens" script improper feed validation ... contents of only those feeds to which an authenticated user has subscribed ...
    (Vuln-Dev)
  • Re: Cant find string terminator ...
    ... I use TextPad as my editor. ... Lately I noticed this on my home machine when running a test script: ... It's like an EOF character is buried in there somewhere. ... Notepad has some quirky behaviour as well. ...
    (comp.lang.perl.misc)