Re: output ampersand using XML::Twig

From: Michel Rodriguez (mirod_at_xmltwig.com)
Date: 10/27/03

  • Next message: Volker Kroll: "mod_perl on apache2 and file uploads"
    Date: Mon, 27 Oct 2003 13:49:27 +0100
    
    

    Dave Roe wrote:
    > I am using XML::Twig to generate HTML output during an Apache request.
    > How can I output ' ' without it being converted into '&nbsp'?
    > (the ampersand is converted into & and I lose the last semi-colon.)
    > Is it an encoding issue or something that can be resolved with a CDATA
    > section?

    > #!/usr/bin/perl -w
    >
    > use strict;
    > use XML::Twig;
    >
    > my $box = new XML::Twig::Elt('box');
    > $box->set_content('&nbsp');
    >
    > # I've also tried this:
    > # my $box = new XML::Twig::Elt('#CDATA' => ' ')->wrap_in('box');
    >
    > my $twig = new XML::Twig();
    > $twig->set_root($box);
    >
    > my $xml = $twig->sprint();
    > print STDERR "$xml\n";

    Hi,

    The easy answer first: the reason you lose the semicolon is... because
    it was never there: you wrote $box->set_content('&nbsp') ;--)

    Now for the real problem:

    When you create the element, using XML::Twig::Elt, the string is stored
    directly as the content of the element, and then escaped when output. So
    the ampersand is normally escaped when you output it as xml, which is
    what sprint does.

    BTW, the reason for this is when you use XML::Twig to process existing
    XML data, then it receives unescaped utf8 strings from the parser
    (expat, through XML::Parser): if you have   in the original XML
    then XML::Twig receives the non-breakable space character, and if you
    have & it receives just &. (unless you use the keep_encoding option,
    in which case no escaping is done).

    There are many ways to deal with this, the basic idea (except for the
    first 2 solutions) being to get the unicode character for   in the
    string, and then playing with with output filters to convert it to an
    entity when sprintf is used. Pick the one you like best (all code below
    tested with 5.8.1, when noted also tested with a stock 5.6.1):

    #!/usr/bin/perl -w

    use strict;
    use XML::Twig;

    my $tag= 'box';

    { # the hackish way: turn off XML escapes for the element content
       # works also on 5.6.1

    my $box = new XML::Twig::Elt( $tag => ' ');
    $box->set_asis( 1);

    my $twig = new XML::Twig();
    $twig->set_root( $box);

    my $xml = $twig->sprint();
    printf STDERR "%-35s: %s\n", "turn off xml escape for the element", $xml;
    }

    { # an other hackish way: use the keep_encoding option
       # works also on 5.6.1

    my $box = new XML::Twig::Elt( $tag => ' ');

    my $twig = new XML::Twig(keep_encoding => 1);
    $twig->set_root( $box);

    my $xml = $twig->sprint();
    printf STDERR "%-35s: %s\n", "use the keep_encoding mode", $xml;
    }

    { # just output the character, unicode-aware browsers
       # will display it properly
       # works also on 5.6.1

    my $box = new XML::Twig::Elt( $tag => "\x{a0}");

    my $twig = new XML::Twig();
    $twig->set_root( $box);

    my $xml = $twig->sprint();
    printf STDERR "%-35s: %s\n", "output character", $xml;
    }

    { # use an Encode output filter that encodes (using decimal
       # character entities) anything outside the pure ascii range

    use Encode;

    my $filter= sub { return encode( "ascii", $_[0], Encode::FB_HTMLCREF) };
    my $twig = new XML::Twig( output_filter => $filter);
    my $box = new XML::Twig::Elt( $tag => "\x{a0}");
    $twig->set_root( $box);

    my $xml = $twig->sprint();
    printf STDERR "%-35s: %s\n", "using html character entities", $xml;
    }

    { # use an Encode output filter that encodes (using hexa
       # character entities) anything outside the pure ascii range

    use Encode;

    my $filter= sub { return encode( "ascii", $_[0], Encode::FB_XMLCREF) };
    my $twig = new XML::Twig( output_filter => $filter);
    my $box = new XML::Twig::Elt( $tag => "\x{a0}");
    $twig->set_root( $box);

    my $xml = $twig->sprint();
    printf STDERR "%-35s: %s\n", "using xml character entities", $xml;
    }

    { # use charnames ':full' to enter the special character by name

    use Encode;
    use charnames ':full';

    my $filter= sub { return encode( "ascii", $_[0], Encode::FB_XMLCREF) };
    my $twig = new XML::Twig( output_filter => $filter);
    my $box = new XML::Twig::Elt( $tag => "\N{NO-BREAK SPACE}");
    $twig->set_root( $box);

    my $xml = $twig->sprint();
    printf STDERR "%-35s: %s\n", "using named entity input", $xml;
    }

    { # use HTML::Entities to get the entity name
       # the second argument to encode_entities ensures that only
       # high-bit charactres are escaped, and not <, > & and ",
       # which are supposed to be output (those characters in the content
       # would be escaped by XML::Twig if needed, see below).
       # works also on 5.6.1

    use HTML::Entities;
    use charnames ':full';

    my $filter= sub { return encode_entities( $_[0], "\x80-\xff") };
    my $twig = new XML::Twig( output_filter => $filter);
    my $box = new XML::Twig::Elt( $tag => "\N{NO-BREAK SPACE}");
    $twig->set_root( $box);

    my $xml = $twig->sprint();
    printf STDERR "%-35s: %s\n", "using named entity output", $xml;

    $box = new XML::Twig::Elt( $tag => "< \N{NO-BREAK SPACE} > &");
    $twig->set_root( $box);

    $xml = $twig->sprint();
    printf STDERR "%-35s: %s\n", "same, checking escapes", $xml;

    }


  • Next message: Volker Kroll: "mod_perl on apache2 and file uploads"

    Relevant Pages

    • XMLDocument generate illegal &#x5; character
      ... I need to generate xml which contains some character like 0x5, ... When I uses .net xml, it generated  in the xml which is illegal ... My question is how to make .NET not encode these ...
      (microsoft.public.dotnet.xml)
    • Re: client-server protocol to object and back (study case)
      ... store the xml of the message and possible others members ... class with the boilerplate of the the response... ... parsing XML or even managing multiple fields positionally in a buffer is nontrivial. ... each subclass of [Encoder] would implement Encoder::encodeto go find the relevant attributes and encode them in the Message::data ADT. ...
      (comp.object)
    • Re: Putting a "<" in an attribute value (was about CDATA sections)
      ... > are ambiguous (section 2.4 essentially says numeric character ... is a lawyer :-) XML has inherited these definitions with very few ... Once validity is established, an application will receive ... <!DOCTYPE header [ ...
      (comp.text.xml)
    • Re: Future of LISP. Alternative to XML. Web 3.0?
      ... I didn't realize it meant literally the cr character within the ... instead of XML representation for queries and responses. ... using s-expressions instead of XML, nobody is going to use it, ... Do *any* of those LISP projects have a server I ...
      (comp.lang.lisp)
    • RE: System.ArgumentException: Illegal characters in path
      ... But I don't use any xml string at all in my web ... It is a default data type string and I wonder it ... cannot accept latin character since string accepts all utf-8 characters. ... Microsoft XML 3.0 SP1 ...
      (microsoft.public.dotnet.framework.webservices)

    Loading