Re: XML::Simple and utf8 woes



corff@xxxxxxxxxxxxxxxxxx wrote:

If, in a fit of desperation, I modify the output of XMLout() with
NumericEscape=>2, all I get in the output is that, eg. a umlaut of
Morgendämmerung (sorry for this encoding-independet symbolic
notation here!) is represented as ä which happens to be the
decimal values of the two octets comprising U+00e4, or Latin small a
with umlaut.

I've been following this thread because I have been struggling with XML::Simple writing/sourcing an XML file in cp932 encoding. The NumericEscape is what resolved the writing and setting the encoding in the xml declaration of the cp932 encoded file to x-sjis-cp932 so XML::Simple would source it properly took me awhile to figure out :-(.

#!/usr/bin/perl

use strict;
use warnings;
use XML::Simple;
use Data::Dumper;
use Encode qw(:all);

my $file = $ARGV[0];
my $outfile = "cp932out.xml";

open my $utf8in, "<:encoding(utf8)", $file or die "In $file: $!";
open my $cp932out, ">:encoding(cp932)", $outfile or die "Out $outfile: $!";

my $utf8So = XMLin($utf8in, KeepRoot => 1, ForceArray => 1, SuppressEmpty => undef);
print Dumper($utf8So);

XMLout($utf8So, OutputFile => $cp932out,
AttrIndent => 1, KeepRoot => 1,
NumericEscape => 1,
XMLDecl => "<?xml version='1.0' encoding='x-sjis-cp932'?>");

close $utf8in;
close $cp932out;

open my $cp932in, "<:encoding(cp932)", "cp932out.xml" or die "XML In $outfile: $!";
my $cp932So = XMLin($cp932in, ForceArray => ['Line_Items'], SuppressEmpty => undef);
print Dumper($cp932So);

Without the NumericEscape in the XMLout I get the following error when writing the cp932 encoded data.
not well-formed (invalid token) at line 75, column 41, byte 3001 at /opt/perl/lib/site_perl/5.8.0/PA-RISC1.1-thread-multi/XML/Parser.pm line 185

My first attempt was to just use IO layers.

open my $utf8in, "<:encoding(utf8)", $file or die "In $file: $!";
open my $cp932out, ">:encoding(cp932)", $outfile or die "Out $outfile: $!";

my $fline = <$utf8in>;
print $cp932out qq~<?xml version='1.0' encoding='x-sjis-cp932'?>~;
while (<$utf8in>) { print $cp932out $_; }

open my $cp932in, "<:encoding(cp932)", "cp932out.xml" or die "XML In $outfile: $!";
my $cp932So = XMLin($cp932in, ForceArray => ['Line_Items'], SuppressEmpty => undef);
print Dumper($cp932So);

This results in:
not well-formed (invalid token) at line 37, column 35, byte 886 at /opt/perl/lib/site_perl/5.8.0/PA-RISC1.1-thread-multi/XML/Parser.pm line 185

Cheers

Dennis
.



Relevant Pages

  • Windows ActiveState Perl: MSXML transformNodeToObject finally succeeded
    ... always specify the encoding in the ... print OFL qq. ... qq{$DomDocument for XML-Input-File}. ... return undef; ...
    (comp.lang.perl.misc)
  • Re: Does [glob] implement (parts of) TIP 131?
    ... |>encodings (korean etc) while reading the source code. ... the Umlaut is near the end of a string constant this yields an TCL ... system encoding, resulting in a syntax error a few lines further on. ...
    (comp.lang.tcl)
  • Re: ascii
    ... Are you certain that you have an a + umlaut in your VB.NET program? ... Remember that characters in .NET are Unicode (not ASCII or ANSI) which means ... Then using the correct encoding or using ...
    (microsoft.public.dotnet.languages.vb)
  • Re: SQLite und Umlauteproblem
    ... > nicht der Umlaut sondern ein anderes Zeichen angezeigt. ... Du hast vermutlich kein Encoding in der Verbindungszeichenfolge ... Next by Date: ...
    (microsoft.public.de.german.entwickler.dotnet.datenbank)
  • Re: pyodbc utf-8
    ... Now i have a little problem with the umlaut. ... If the type is str, which encoding do they use? ... you could at a few strings containing ...
    (comp.lang.python)