Re: [XML::Simple-2.12] problems parsing non ASCII strings



Le Tue, 12 Jul 2005 19:16:53 +0200, Michel Rodriguez a écrit :

> Jul wrote:
>> module: XML::Simple-2.12 (also tried 2.14)
>> perl version: 5.00503
>
> Wahouh! Do you know how old this is? 5, 6 years old?

I know it's very very old, that's why I mentionned it, I'm looking for a
way to trick it, like I did for other perl5.6 modules used :o)
I guess we can sometimes rename "hosting solutions" to "hosting problems",
but it would be less attractive to the custommer ;-)

>> I need to parse and write a XML configuration file wich contains
>> non-ASCII caraters (like 'é', in french). I've choosen, XML::Simple
>> with XML::Parser for these tasks, but everything works fine if and only
>> if I do not include any special carater in the file, otherwise the HASH
>> returned by XMLin() is totaly messed up.
>
> What is the encoding of your file? My guess is that it is in either
> ISO-8859-1 (or -15) or some kind of windows-12nn
>
> What happens is that the data is read, probably by expat, and converted
> to UTF-8. The "totaly messed up" characters are in fact perfectly valid
> UTF-8 characters, that your terminal (or whatever you use to display
> them) is not set to display.
>
> If XML::Simple can read it then the encoding must be declared in the XML
> declaration, at the beginning of the XML file.

The default encoding protocol should be ISO-8859-1 or -15, that's why I
expected to retreive the same encoding type.
With the encoding attribute set in the declaration, it goes better, yo'ure
right, and I've been surprised to see that UTF-8 is also supported, even
with perl 5.005 :-)

> Your choices are either to convert those characters back to the original
> encoding, look at the Unicode::* modules on CPAN, or to bite the Unicode
> bullet and learn how to work with UTF-8 data. In the long run the second
> option makes more sense, but YMMV.

Now, the original caracter is displayed as ISO-8859-15, but coded
with UTF-8. You're right again! lol
At this time, I wonder wether UTF-8 is the default carset or wether there
is an option available for XML::Simple or XML::Parser. I took a look into
those modules documentation but didn't get much.
Otherwise, I'll try to convert data outside XML::Simple.

> But really, processing XML with perl 5.00503 seems like a bad idea to me.

I agree with you, but I have no choice right now. I got perl 5.005 in one
hand and a project to rise on the other. Here is what I have to deal with.
Maybe another way to parse a configuration file would be easier, but I
like the idea to have a reason to play with XML, and I didn't really found
what I want with the modules previously tested.


Thank you very much for your help, it's been really usefull to me.


Julien
.



Relevant Pages

  • Re: Want Input boxes to accept unicode strings on Standard Window
    ... If ther encoding is not specified, then the encoding is assumed to be ... Ah, UTF-8. ... That would be wrong according to the standard. ... when producing XML files. ...
    (microsoft.public.vc.mfc)
  • Re: tDOM doesnt support encoding=ASCII?
    ... a Tcl channel then Tcl will ... specifically asked for binary encoding), so any XML encoding declaration ... but when tdom sees it it is almost certainly UTF-8. ...
    (comp.lang.tcl)
  • Re: UTF-8 encoding problem
    ... Declaration having the "encoding" attribute at the begining of file ... What I am saying is the "encoding" of your physical file is different then the logical file (the xml itself). ... It sounds like your physical file is UTF-8, while I'm concerned your logical file is whatever, where whatever is the text you blindly copied from an MSDN article. ...
    (microsoft.public.dotnet.languages.vb)
  • Re: Want Input boxes to accept unicode strings on Standard Window
    ... simply writes std::string to and from XML. ... Why does it need to understand UTF-8? ... And if you read an XML and the encoding ... You can also read/write ANSI using std::string, ...
    (microsoft.public.vc.mfc)
  • Re: character encoding in CGI.pm
    ... > is XML defined such that this is a perfectly valid situation? ... The ISO-8859-1 encoding is used for the HTTP ... Then it will be interpreted as UTF-8. ... by the HTTP decoding code; they are simply passed to the next part. ...
    (comp.lang.perl.misc)