Re: XML::Simple XMLIn() and odd chars



On 08/05/2008 03:37 PM, dr fence wrote:
I broke it down to a small example. I don't understand why I don't get the same output on both systems.

--------- <OUTPUT System A> ----------------

perl tiny_xml.pl tiny.xml
reading tiny.xml
And They're Off<br>2008
$VAR1 = {
'title' => "And They\x{e2}\x{80}\x{99}re Off<br>2008"
};

[A]$ cat tiny.txt
And They're Off<br>2008
[...]

\x{e2}\x{80}\x{99} seems to be unicode character \x{2019} (’) which is an alternate quote character (apostrophe?); the normal one is \x{27} ('). For some reason, your input file is getting the alternate character in it. When I copied and pasted your tiny.xml, I didn't get the alternate quote character. The copy of tiny.xml that I have base64-encodes to this:

PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iaXNvLTg4NTktMSI/Pgo8Ym9vaz4KICAgIDx0
aXRsZT5BbmQgVGhleSdyZSBPZmYmbHQ7YnImZ3Q7MjAwODwvdGl0bGU+CjwvYm9vaz4K

Also, the shebang line (first line) of your tiny_xml.pl script was wrong (missing a "!"). These files produce what I think is the right output on my system:

--------------file:tiny_xml.pl--------------
#!/usr/bin/perl
use XML::Simple;
use Data::Dumper;
use CGI qw/header/;

print header(
'-content-type' => 'text/plain',
-charset => 'iso-8859-1',
);
my $xs = XML::Simple->new();
my $filename = 'tiny.xml';
print "reading $filename\n";
my $xml = $xs->XMLin($filename);

print "$xml->{title}\n";
print Dumper($xml);

my $tinytxt = '/dev/shm/tiny.txt';
open RF, '>', $tinytxt or die("open failed: $!\n");
print RF "\$xml->{title}: $xml->{title}\n";
close RF;
chmod 0666, $tinytxt;

------------file:tiny.xml----------------
<?xml version="1.0" encoding="iso-8859-1"?>
<book>
<title>And They're Off&lt;br&gt;2008</title>
</book>

------------OUTPUT---------------------
HTTP/1.1 200 OK
Date: Tue, 05 Aug 2008 22:45:15 GMT
Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch11 mod_perl/2.0.2 Perl/v5.8.8
Connection: close
Content-Type: text/plain; charset=iso-8859-1

reading tiny.xml
And They're Off<br>2008
$VAR1 = {
'title' => 'And They\'re Off<br>2008'
};
------------end--------------

So you seem to have two problems: \x{2019} appears where you don't want it to, and system B it not set up to display UTF-8 correctly; this is not a major problem since the output is supposed to go to a browser--not the console. Just make sure that the locale en_US.UTF-8 is enabled, and run the script from the webserver, and provide the proper HTTP header specifying the charset utf-8.

If you wish to run from an X-terminal for debugging purposes (like I do), then you'll need to set LANG=en_US.UTF-8 and start X under that. A terminal emulator that can handle unicode (such as urxvt) is also a good idea.
.



Relevant Pages

  • Re: OT: Sipson down
    ... snip much common sense ... Excellent reading, thanks. ... It appears to suffer from one major problem, ...
    (uk.rec.motorcycles)
  • Re: First SF book you read
    ... Reading SF and F&E the wrong way round wasn't a ... major problem, but Istr that _Foundation_, when I ... Devil, when he is its only explanation. ...
    (rec.arts.sf.written)
  • Re: another COM-Question: how to inject a type-lib into a dll?
    ... I think I tried that (reading about MIDL #include's and that stuff), ... I will try t set up a minimal project so I can ... A major problem was that the IDL file was ...
    (microsoft.public.vc.atl)
  • Re: another COM-Question: how to inject a type-lib into a dll?
    ... I think I tried that (reading about MIDL #include's and that stuff), ... I will try t set up a minimal project so I can ... A major problem was that the IDL file was ...
    (microsoft.public.vc.atl)

Loading