Re: [XML::Simple-2.12] problems parsing non ASCII strings



Jul wrote:
module: XML::Simple-2.12 (also tried 2.14)
perl version: 5.00503

Wahouh! Do you know how old this is? 5, 6 years old?

I need to parse and write a XML configuration file wich contains
non-ASCII caraters (like 'é', in french).
I've choosen, XML::Simple with XML::Parser for these tasks, but everything
works fine if and only if I do not include any special carater in the
file, otherwise the HASH returned by XMLin() is totaly messed up.

What is the encoding of your file? My guess is that it is in either ISO-8859-1 (or -15) or some kind of windows-12nn


What happens is that the data is read, probably by expat, and converted to UTF-8. The "totaly messed up" characters are in fact perfectly valid UTF-8 characters, that your terminal (or whatever you use to display them) is not set to display.

If XML::Simple can read it then the encoding must be declared in the XML declaration, at the beginning of the XML file.

Your choices are either to convert those characters back to the original encoding, look at the Unicode::* modules on CPAN, or to bite the Unicode bullet and learn how to work with UTF-8 data. In the long run the second option makes more sense, but YMMV.

But really, processing XML with perl 5.00503 seems like a bad idea to me.

--
mirod
.



Relevant Pages

  • Re: DB2 UTF-8 ODBC double conversion
    ... UTF-8 *is* Unicode. ... byte to store characters in the 7-bit ASCII code. ... If I give a UTF-8 string to CreateFile, ... this means that everyone who is using that database has to understand that the ...
    (microsoft.public.vc.mfc)
  • =?utf-8?B?UmU6IFN0cmluZyAiw6LigqzihKIiIHRyYW5zbGF0ZWQgdG8gYXBvc3Ryb3BoZS4gV2h5Pw==?=
    ... it works), though it seems to use mostly just Ascii characters, representing ... but the author is not making the best possible use of UTF-8. ... They don't map it to ASCII apostrophe, ... Latin 1 encoding. ...
    (alt.html)
  • [PATCH] UTF-8 input: composing non-latin1 characters, and copy-paste
    ... One can put the keyboard driver into Unicode mode, load a Unicode keymap, and get single keystrokes generate valid UTF-8 for non-ASCII characters. ...
    (Linux-Kernel)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... For any language using a Latin ... Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. ... But you'll find something that does a reasonable job and *will* work perfectly for most programmers who stick to ASCII identifiers. ... A related problem is if you are making identifiers case-insensitive - it's hard to figure out cases for non-ASCII characters. ...
    (comp.arch.embedded)
  • Re: Special Characters in Query String
    ... I've had numerous problems with utf-8, ... in common characters in spanish not geting displayed. ... > available for encoding of characters. ... > If you can display your characters with ISO-8859-1, ...
    (microsoft.public.dotnet.framework.aspnet)