Re: UnicodeDecodeError help please?
- From: "Fredrik Lundh" <fredrik@xxxxxxxxxxxxxx>
- Date: Fri, 7 Apr 2006 18:52:52 +0200
Robin Haswell wrote:
Could someone explain to me what I'm doing wrong here, so I can hope to
throw light on the myriad of similar problems I'm having? Thanks :-)
Python 2.4.1 (#2, May 6 2005, 11:22:24)
[GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
'utf-8'import sys
sys.getdefaultencoding()
that's bad. do not hack the default encoding. it'll only make you sorry
when you try to port your code to some other python installation, or use
a library that relies on the factory settings being what they're supposed
to be. do not hack the default encoding.
back to your code:
©import htmlentitydefs
char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a copyright symbol
print char
that's a standard (8-bit) string:
<type 'str'>type(char)
169ord(char)
1len(char)
one byte that contains the value 169. looks like ISO-8859-1 (Latin-1) to me.
let's see what the documentation says:
entitydefs
A dictionary mapping XHTML 1.0 entity definitions to their replacement
text in ISO Latin-1.
alright, so it's an ISO Latin-1 string.
Applestr = u"Apple"
print str
<type 'unicode'>type(str)
5len(str)
that's a 5-character unicode string.
Traceback (most recent call last):str + char
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0:
unexpected code byte
you're trying to combine an 8-bit string with a Unicode string, and you've
told Python (by hacking the site module) to treat all 8-bit strings as if they
contain UTF-8. UTF-8 != ISO-Latin-1.
so, you can of course convert the string you got from the entitydefs dict
to a unicode string before you combine the two strings
>>> unicode(char, "iso-8859-1") + str
u'\xa9Apple'
but the htmlentitydefs module offers a better alternative:
name2codepoint
A dictionary that maps HTML entity names to the Unicode
codepoints. New in version 2.3.
which allows you to do
u'\xa9'char = unichr(htmlentitydefs.name2codepoint["copy"])
char
u'\xa9Apple'char + str
without having to deal with things like
1len(htmlentitydefs.entitydefs["copy"])
7len(htmlentitydefs.entitydefs["rarr"])
Basically my app is a search engine - I'm grabbing content from pages
using HTMLParser and storing it in a database but I'm running in to these
problems all over the shop (from decoding the entities to calling
str.lower()) - I don't know what encoding my pages are coming in as, I'm
just happy enough to accept that they're either UTF-8 or latin-1 with
entities.
UTF-8 and Latin-1 are two different things, so your (international) users
will hate you if you don't do this right.
It's even worse that I've written the same app in PHP before with none of
these problems - and PHP4 doesn't even support Unicode.
a PHP4 application without I18N problems? I'm not sure I believe you... ;-)
</F>
.
- References:
- UnicodeDecodeError help please?
- From: Robin Haswell
- UnicodeDecodeError help please?
- Prev by Date: Why did someone write this?
- Next by Date: Re: FTP
- Previous by thread: Re: UnicodeDecodeError help please?
- Next by thread: Re: UnicodeDecodeError help please?
- Index(es):
Relevant Pages
|
|