Re: Serious Perl Regular Expression deficiency?



On Fri, 23 Dec 2005 15:17:21 -0800, robic0 wrote:

I'm back on the job.
I'm going to post some new code this week that
complies with XML spec.

This is the solution for the Comment/CDATA paradigm
that will be incorporated in the new version:

use strict;
use warnings;

$_ = '
<![CDATA[ <!-- imbed comment --> some text <!-- imbed as well -->]]>

<!--
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>
-->

<!-- This is a real comment -->

';

#### This section of parser deals with
#### circular non-markup imbedding issues.
#### (one inside the other, and so forth)
#### So far just comments & cdata.
#### Use the general substitution magic.
#### This is valid because nesting of
#### comments nor cdata is allowed.

my $cnt = 1;
my %root = ();
my %cdata_elements = ();

print "\n";

# -- Comments (done first) --
while (s/(<!--(.*?)-->)/[$cnt]/s) {
$root{$cnt} = $1;
print "$cnt = Questionable comment: $1\n"; $cnt++;
}
print "\n\n",'='x60,"\n\nThe \"Real\" Stuff -->\n\n";
# -- CDATA (done second) --
while (s/<!\[CDATA\[(.*?)\]\]>/[$cnt]/s)
{
# reconstitute cdata element contents
my $cdata_contents = $1;
my $str = '';
while ( $cdata_contents =~ s/([^\[\]]+)|\[([\d]+)\]//i )
{
if (defined $1)
{
$str .= $1;
}
elsif (defined $2 && exists $root{$2})
{
$str .= $root{$2};
delete $root{$2};
}
else {
my $j = 0; # shouldn't get here
}
}
$root{$cnt} = $str;
$cdata_elements{$cnt} = '';

print "\n$cnt = REAL CDATA: $root{$cnt}\n"; $cnt++;
}
# -- Process leftover comments that are real --
while (my ($key,$val) = each (%root)) {
if (!defined $cdata_elements{$key}) {
# This $root re-assignment is not really necessary
# since $1 will contain the processing text that
# will be processed here, then never used again.
$root{$key} =~ s/<!--(.*?)-->/$1/s;
print "\n$key = REAL COMMENT: $root{$key}\n"; # Or $1
}
}


__END__

1 = Questionable comment: <!-- imbed comment -->
2 = Questionable comment: <!-- imbed as well -->
3 = Questionable comment: <!--
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>
-->
4 = Questionable comment: <!-- This is a real comment -->


============================================================

The "Real" Stuff -->


5 = REAL CDATA: <!-- imbed comment --> some text <!-- imbed as well
-->

4 = REAL COMMENT: This is a real comment

3 = REAL COMMENT:
wasdfvgasvbg <![CDATA[ not really a CDATA ]]>
<tag>at tag in a real comment</tag>
<![CDATA[ not a CDATA ]]>


.