Re: LWP and Unicode
- From: Ben Morrow <benmorrow@xxxxxxxxxxxxx>
- Date: Mon, 2 Oct 2006 22:40:23 +0100
Quoth "Dale" <dale.gerdemann@xxxxxxxxxxxxxx>:
I have a couple of questions/problems concerning LWP and
Unicode. Here's an ultra-simple program that goes to a web page,
downloads it's contents and prints them out in a semi-readable form:
----------------------------------
#!/.../perl-5.8.8/bin/perl -CSDA
I presume this isn't your real #! line...
Do you know what -CSDA does? In this case it is useless, unless it
interferes with LWP's filehandle encodings. It is probably best avoided
until you understand Perl's (slightly odd) Unicode handling better.
use utf8;
use LWP;
use Encode;
use URI::Escape;
my $browser = LWP::UserAgent->new;
$browser->parse_head(0);
my $url = 'http://bg.wiktionary.org/wiki/LotsaCyrillic';
Please don't post 8-bit data (including UTF8) to Usenet unless the
group's charter explicitly permits it.
my $response = $browser->get(encode("utf8", $url));
my $content = decode("utf8", uri_unescape($response->content));
print "$content\n";
----------------------------------
Question 1: Why do I need the line that says
$browser->parse_head(0);
You don't. The docs for this are (surprisingly) in perldoc
LWP::UserAgent.
Question 2: Why do I need to explicitly say:
decode("utf8", ...)
Isn't there a way to tell LWP that the content is utf8? Or more
precisely, that it is utf8 with some URI percent escapes.
Not AFAIK. You probably ought to decode the data before you uri_unescape
it; one of the virtues of UTF-8 is that this doesn't matter, but it
would for other encodings.
Question 3: If you change the pragma "use utf8" to "use encoding
'utf8'" then you don't need the call to "decode("utf8", ...)". Why
should this be? What's the difference between "use utf8" and "use
encoding 'utf8'"? The perldoc:perlunicode is no help here.
The differences are
1. encoding supports many encodings.
2. encoding is probably negligbly slower.
3. encoding gives decent error recovery (as opposed to crashing
perl).
4. encoding sets a default PerlIO layer on STDIN and STDOUT, unless
you've already done so with the -C switch.
I can see no reason why the two should give different results in this
case; but perhaps your -CSDA is interfering.
Question 4: In the original program, replace the line
my $content = decode("utf8", uri_unescape($response->content));
with
my $content = $response->content;
utf8::upgrade($content);
The perldoc:perlunicode page says you should do this when, for some
reason, Unicode does not happen. But this does nothing for me. I still
end up with bytes.
IMHO perlunicode is wrong in this regard :). The utf8::* functions are
part of the internal implementation of utf8-handling; users should never
have cause to use them.
As of 5.8, Perl strings have an internal flag that marks them as being
stored in utf8. What utf8::upgrade does is
1. If the string already has the UTF8 flag on, quit.
2. For every top-bit-set byte in the string:
3. Look up the appropriate character in ISO8859-1, and
4. Replace the byte with that character's 2-byte encoding in
utf8.
5. Set the UTF8 flag on the string, so that Perl now sees those
2-byte sequences as one character each.
The net result, from the Perl level, is that *absolutely nothing has
changed*. The *only* Perl-visible change is that utf8::is_utf8 now
returns true, even if it returned false before; but you *shouldn't be
concerned with that*.
The correct function for 'this bunch of bytes happens to be a piece of
UTF8-encoded text; decode it and give me a string containing those
characters' is Encode::decode, as you have established.
Ben
--
For far more marvellous is the truth than any artists of the past imagined!
Why do the poets of the present not speak of it? What men are poets who can
speak of Jupiter if he were like a man, but if he is an immense spinning
sphere of methane and ammonia must be silent?~Feynmann~benmorrow@xxxxxxxxxxxxx
.
- Follow-Ups:
- Re: LWP and Unicode
- From: Dale
- Re: LWP and Unicode
- From: Dale
- Re: LWP and Unicode
- From: Dale
- Re: LWP and Unicode
- References:
- LWP and Unicode
- From: Dale
- LWP and Unicode
- Prev by Date: Re: get rid of ^M
- Next by Date: Re: Inline C and different platforms
- Previous by thread: Re: LWP and Unicode
- Next by thread: Re: LWP and Unicode
- Index(es):
Relevant Pages
|