keep entity references while parsing with XML::Parser
- From: a.heuboeck@xxxxxxxxxxxxx (Alois Heuboeck)
- Date: Thu, 15 Sep 2005 17:26:20 +0100
Hi Perlers,
I'm trying to do the following:
1- take an XML file
2- in one script, replace everything above Unicode #x7F (end of ASCII) with entity references (which can either have "special" names, like ä or be based on the Unicode nb. like ®)
3- then in another script, do some more transformations using XML::DOM and
4- print out resulting XML
My problem is that in the third step, when parsing its input, the XML::Parser seems to resolve those references that contain the HEX Unicode nb.; the "special name" references are not resolved.
My input looks somewhat like this:
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE TEI.2 SYSTEM "E:/TEI.dtd"> <TEI.2> <w:t> ä NetMachanic®technical evaluation </w:t> <w:t> âand LinkPopularity are two tools for organisation. </w:t> <w:t> ââââ </w:t> <w:t> ®®®® </w:t> </TEI.2>
I tried the option NoExpand and also implemented a default handler, which "will be called when an entity reference is seen in text" (http://www.socsci.umn.edu/ssrf/doc/xml/enno-xml-docs/users.erols.com/enno/xml/XML/Parser/Expat.html),
so I have:
--------------------
#!/usr/bin/perl use strict; use XML::DOM; use warnings;
my $infile = "INFILE.xml";
my $dom_parser = new XML::DOM::Parser(
NoExpand => 1,
Handlers => {
Default=>\&handle_default,
Char=>\&handle_char,
});my $TREE = $dom_parser->parsefile($infile);
# here transform $TREE with XML::DOM
open OUT, ">OUTFILE.xml" or die "cannot write to OUT file"; print OUT $TREE->toString(); close OUT;
sub handle_char {my ($parser, $string) = @_; my $rec = $parser->recognized_string(); my $esc = $parser->xml_escape($rec);
open LOG, ">>log.txt"; print LOG "\n--\ncall of handle_char()\n"; print LOG "[$string||$rec//$esc]\n"; }
sub handle_default {my ($parser, $string) = @_; my $rec = $parser->recognized_string(); my $esc = $parser->xml_escape($rec);
open LOG, ">>log.txt"; print LOG "\n--\ncall of handle_default()\n"; print LOG "[$string||$rec//$esc]\n"; }
--------------------
Now, my problems:
First, handle_default() is not called for the entity references ® and â but only for ä
® and â trigger handle_Char() instead.
Second, the NoExpand option does not what I thought it would, namely not expand the entity references.
Finally, the unresolved string in handle_Char() can be seen in $rec and $esc; the resolved one is in $string.
But how can I get this out to $TREE? All the textbook examples of handlers I saw just printed out some message.
Another strange thing occurs in the last two <w:t> elements:
the first are four references to small letter a with circumflex; the second one four references to the REGISTERED TRADEMARK SIGN. What I get (when I don't set the Default and Char handlers) is:
<t> 㢃â </t> for the first and
<t> ®®®® </t> four (R) for the second
In the first case, resolving the reference â seems to "eat" some of the following characters (also occurs when followed by normal character text).
Could anyone please give advice? Thanks,
Alois
.
- Prev by Date: RE: module and a class?????
- Next by Date: RE: extract web pages from a web site
- Previous by thread: Re: spreadsheets to ldif (trouble looping through array)..
- Next by thread: Assistance needed with script.......
- Index(es):