Re: [XML::Simple-2.12] problems parsing non ASCII strings



Jul wrote:
module: XML::Simple-2.12 (also tried 2.14)
perl version: 5.00503

Wahouh! Do you know how old this is? 5, 6 years old?

I need to parse and write a XML configuration file wich contains
non-ASCII caraters (like 'é', in french).
I've choosen, XML::Simple with XML::Parser for these tasks, but everything
works fine if and only if I do not include any special carater in the
file, otherwise the HASH returned by XMLin() is totaly messed up.

What is the encoding of your file? My guess is that it is in either ISO-8859-1 (or -15) or some kind of windows-12nn


What happens is that the data is read, probably by expat, and converted to UTF-8. The "totaly messed up" characters are in fact perfectly valid UTF-8 characters, that your terminal (or whatever you use to display them) is not set to display.

If XML::Simple can read it then the encoding must be declared in the XML declaration, at the beginning of the XML file.

Your choices are either to convert those characters back to the original encoding, look at the Unicode::* modules on CPAN, or to bite the Unicode bullet and learn how to work with UTF-8 data. In the long run the second option makes more sense, but YMMV.

But really, processing XML with perl 5.00503 seems like a bad idea to me.

--
mirod
.



Relevant Pages

  • =?utf-8?B?UmU6IFN0cmluZyAiw6LigqzihKIiIHRyYW5zbGF0ZWQgdG8gYXBvc3Ryb3BoZS4gV2h5Pw==?=
    ... it works), though it seems to use mostly just Ascii characters, representing ... but the author is not making the best possible use of UTF-8. ... They don't map it to ASCII apostrophe, ... Latin 1 encoding. ...
    (alt.html)
  • [PATCH] UTF-8 input: composing non-latin1 characters, and copy-paste
    ... One can put the keyboard driver into Unicode mode, load a Unicode keymap, and get single keystrokes generate valid UTF-8 for non-ASCII characters. ...
    (Linux-Kernel)
  • Re: Enhanced Unicode support for "Go" tools
    ... maybe Rene and Randy to note, perhaps - is an "ASCII compatible" ... version of UNICODE...in fact, for strict 7-bit ASCII, UTF-8 and ... characters so, being on Windows, that opinion makes great sense ... where the majority of the supported languages ...
    (alt.lang.asm)
  • Re: Special Characters in Query String
    ... I've had numerous problems with utf-8, ... in common characters in spanish not geting displayed. ... > available for encoding of characters. ... > If you can display your characters with ISO-8859-1, ...
    (microsoft.public.dotnet.framework.aspnet)
  • RichEdit EM_STREAMIN CP_UTF8 nulls out some input characters
    ... When I read a file encoded as UTF-8 into a RichEdit control, ... some of the characters from the input file are being replaced with nulls. ... LRESULT APIENTRY MainProc(HWND hwnd, UINT msg, WPARAM wparam, LPARAM lparam) ... WCHAR* fnp; ...
    (microsoft.public.win32.programmer.ui)