LibXML UTF8 - Input is not proper UTF-8, indicate encoding !

From: Vlajko Knezic (vava_at_vkkk.net)
Date: 03/05/05

  • Next message: Jeff Seale: "displaying STDOUT containing multiple data entries"
    Date: Fri, 4 Mar 2005 23:34:55 -0500
    
    

    Not so sure what is going on here but is something to do with the way UTF8
    is handled in Perl and/or LibXML

    The sctript below:

    - accepts a value from a form text field;

    - builds XML document around it,

    - deparses the document to the string using toString(),

    - parses the string into the XML document using parse_string()

    - transforms XML document into HTML document using XSL
    transformation

    Everything works well until UTF8 character is entered in the text field (for
    example é) . In that case when trying to run parse_string() code crashes
    with the message:

    =====================================================================

    :2: parser error : Input is not proper UTF-8, indicate encoding
    !<test><test_text>abcé</test_text></test> ^:2: error:
    Bytes: 0xE9 0x3C 0x2F 0x74<test><test_text>abcé</test_text></test>
    ^ at C:/_work/vsurvey/site/test1.cgi line
    24=====================================================================

    I know that the code below does not make much sense but this is an
    abstraction of the much more complex code. Environment is Perl 5.8; Apache;
    Windows XP.

    Hints and/or explanation what was coded wrong and how should it be fixed are
    very much appreciated.

    Vlajko Knezic,

    Toronto, Ontario

    ---------------------------------------------------------------------------------------------------------------------

    test.cgi

    #! c:/Perl/bin/Perl.exe

    use CGI;

    use XML::LibXML;

    use XML::LibXSLT;

    use CGI::Carp qw( fatalsToBrowser );

    use Encode;

    my $mDocument = XML::LibXML::Document-> new();

    my $parser = XML::LibXML->new();

    $mDocument->setEncoding("UTF8");

    my $mCGI = new CGI;

    print $mCGI->header;

    my $mTest_text = $mCGI->param('test');;

    my $mTest = $mDocument-> createElement("test");

    my $mTestText = $mDocument-> createElement("test_text");

    $mTestText->appendTextNode($mTest_text);

    $mTest->appendChild($mTestText);

    $mDocument->setDocumentElement( $mTest );

    $mDocument->setEncoding("UTF8");

    my $mTestXML = $mDocument->toString();

    my $mParsedTestXML = $parser->parse_string($mTestXML);

    my $mParsedXMLXSL = $parser->parse_file('test.xsl');

    my $mParserXSL = XML::LibXSLT->new();

    my $mParsedXSL = $mParserXSL->parse_style***($mParsedXMLXSL);

    my $mPageHTML = $mParsedXSL->transform($mParsedTestXML);

    my $mPrintPageHTML = $mParsedXSL->output_string($mPageHTML);

    print $mPrintPageHTML;

    test.xsl

    <?xml version="1.0"?>

    <xsl:style*** xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">

    <xsl:output method="html" encoding="UTF-8" indent="yes"
    omit-xml-declaration="yes"/>

    <xsl:template match="//test">

      <head>

        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

      </head>

      <html>

        <body>

        <xsl:value-of select="test_text"/>

        <form name="test" type="post" target="_self">

          <input type="text" name="test" /><input type="submit" name="button"/>

        </form>

        </body>

      </html>

    </xsl:template>

    </xsl:style***>


  • Next message: Jeff Seale: "displaying STDOUT containing multiple data entries"