Re: Serious Perl Regular Expression deficiency?
- From: robic0
- Date: Mon, 26 Dec 2005 19:17:04 -0800
On Fri, 23 Dec 2005 15:17:21 -0800, robic0 wrote:
I'm back on the job.
I'm going to post some new code this week that
complies with XML spec.
This is the solution for the Comment/CDATA paradigm
that will be incorporated in the new version:
use strict;
use warnings;
$_ = '
<![CDATA[ <!-- imbed comment --> some text <!-- imbed as well -->]]>
<!--
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>
-->
<!-- This is a real comment -->
';
#### This section of parser deals with
#### circular non-markup imbedding issues.
#### (one inside the other, and so forth)
#### So far just comments & cdata.
#### Use the general substitution magic.
#### This is valid because nesting of
#### comments nor cdata is allowed.
my $cnt = 1;
my %root = ();
my %cdata_elements = ();
print "\n";
# -- Comments (done first) --
while (s/(<!--(.*?)-->)/[$cnt]/s) {
$root{$cnt} = $1;
print "$cnt = Questionable comment: $1\n"; $cnt++;
}
print "\n\n",'='x60,"\n\nThe \"Real\" Stuff -->\n\n";
# -- CDATA (done second) --
while (s/<!\[CDATA\[(.*?)\]\]>/[$cnt]/s)
{
# reconstitute cdata element contents
my $cdata_contents = $1;
my $str = '';
while ( $cdata_contents =~ s/([^\[\]]+)|\[([\d]+)\]//i )
{
if (defined $1)
{
$str .= $1;
}
elsif (defined $2 && exists $root{$2})
{
$str .= $root{$2};
delete $root{$2};
}
else {
my $j = 0; # shouldn't get here
}
}
$root{$cnt} = $str;
$cdata_elements{$cnt} = '';
print "\n$cnt = REAL CDATA: $root{$cnt}\n"; $cnt++;
}
# -- Process leftover comments that are real --
while (my ($key,$val) = each (%root)) {
if (!defined $cdata_elements{$key}) {
# This $root re-assignment is not really necessary
# since $1 will contain the processing text that
# will be processed here, then never used again.
$root{$key} =~ s/<!--(.*?)-->/$1/s;
print "\n$key = REAL COMMENT: $root{$key}\n"; # Or $1
}
}
__END__
1 = Questionable comment: <!-- imbed comment -->
2 = Questionable comment: <!-- imbed as well -->
3 = Questionable comment: <!--
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>
-->
4 = Questionable comment: <!-- This is a real comment -->
============================================================
The "Real" Stuff -->
5 = REAL CDATA: <!-- imbed comment --> some text <!-- imbed as well
-->
4 = REAL COMMENT: This is a real comment
3 = REAL COMMENT:
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>
.
- Follow-Ups:
- Re: Serious Perl Regular Expression deficiency?
- From: robic0
- Re: Serious Perl Regular Expression deficiency?
- References:
- Serious Perl Regular Expression deficiency?
- From: robic0
- Serious Perl Regular Expression deficiency?
- Prev by Date: Re: Serious Perl Regular Expression deficiency?
- Next by Date: Re: My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)
- Previous by thread: Re: Serious Perl Regular Expression deficiency?
- Next by thread: Re: Serious Perl Regular Expression deficiency?
- Index(es):