Re: Serious Perl Regular Expression deficiency?
- From: robic0
- Date: Sat, 24 Dec 2005 12:34:38 -0800
On 23 Dec 2005 20:13:08 -0800, castillo.bryan@xxxxxxxxx wrote:
>robic0 wrote:
>> I don't see a solution to this problem that
>> regular expressions can't exclude a string when
>> processing. It can exclude individual characters
>> fine. I started doing Perl 2 years ago and have
>> run into this nagging problem several times.
>>
>> After extensive read on the Perl docs on re's
>> (especially in the last 2 days) I have come to the
>> conclusion that regular expressions have a serious
>> deficiency. This is serious because the not string
>> is a fundimental basic logic idea in a search from
>> a touted master search engine or should be.
>> To a degree it works with a known subset, but it
>> won't work to the degree shown below. This is a
>> serious flaw in regualar expressions!
>>
>> I hope you masters can prove me wrong! I really do.
>> If not I would hope that the Perl authors can provide
>> some insight on when this construct can be fixed,
>> aka implemented.
>>
>> Beat this code if you can (you can't). Don't look
>> at the code in this example, look instead at the
>> output.
>> Don't comment on any code syntax because thats not
>> welcome or the point.
>> Instead, refer you comments to the output ID's.
>>
>> If you know of a way Perl regex can do this
>> please reply. I'm almost %99 sure Perl regex
>> can't do this. In fact the %1 is thrown out here
>> to either verify that or prove otherwise.
>>
>
>Its not clear what "this" is. Are you asking if perl can do a negative
>match on a string, pull out XML comments with a regex, or both?
>
>If you are wondering about a negative string match, look at the perlre
>documentation, specifically negative lookahead and lookbehind
>assertions.
>
>If you want to pull out the contents of XML comments you could do this.
>
>
>sub test_xml_comment_parse {
> my ($xml) = @_;
> print "XML\n", '-' x 40, "\n", $xml, "\n", '-' x 40, "\n";
> while ($xml =~ s/<!--(.*?)-->//ms) {
> print "Comment [$1]\n"
> }
> print "\n", '-' x 40, "\n\n\n";
>}
>
>my $gabage1 = '
><big name="asdf" date="33" >
> asdf
> <!-- howdy folks -->
> <in2>jjjj</in2>
> <!-- and still more -->
> asdfb
></big>
>';
>
>my $gabage2 = '
><big name="asdf" date="33" >
> asdf
> <!-- howdy folks %SYSTEM is down <who cares?> -->
> <in2>jjjj</in2>
> <!-- and still more -->
> asdfb
></big>
>';
>
>test_xml_comment_parse($_) foreach ($gabage1,$gabage2);
>
>output:
>
>XML
>----------------------------------------
>
><big name="asdf" date="33" >
> asdf
> <!-- howdy folks -->
> <in2>jjjj</in2>
> <!-- and still more -->
> asdfb
></big>
>
>----------------------------------------
>Comment [ howdy folks ]
>Comment [ and still more ]
>
>----------------------------------------
>
>
>XML
>----------------------------------------
>
><big name="asdf" date="33" >
> asdf
> <!-- howdy folks %SYSTEM is down <who cares?> -->
> <in2>jjjj</in2>
> <!-- and still more -->
> asdfb
></big>
>
>----------------------------------------
>Comment [ howdy folks %SYSTEM is down <who cares?> ]
>Comment [ and still more ]
>
>----------------------------------------
>
>
>
>
>
>
>
>There is a problem though. If you need to retrieve data from xml
>documents, you should generally use an XML parser instead of using your
>own regular expressions.
>
>Here is 1 case where the code I posted above would pull out the text
>"not really a comment", that isn't really a comment.
>
><test_xml>
> <value>
> <![CDATA[ <!-- not really a comment --> ]]>
> </value>
></test_xml>
Thanks alot
Yes the first occurance (?) does the trick /<!--(.*?)-->/
And given nesting is not allowed here this will do it.
This had worked for me before, I should have stuck with it.
The //m is not really of help here since the xml could
be without newlines.
I found xml specs from
http://www.w3.org/TR/1998/REC-xml-19980210#sec-cdata-sect
I will use that to finish this code.
About the CDATA thing you mentioned. No, thats not really a
problem. The order of the regex is such that "all" non-markup
items are processed out first.
So in this case all CDATA will be removed first followed by
all comments and any other weird ones like versioning.
I like the specs, it makes it easy to write the regex.
quote:
CDSect ::= CDStart CData CDEnd
[19] CDStart ::= '<![CDATA['
[20] CData ::= (Char* - (Char* ']]>' Char*))
[21] CDEnd ::= ']]>'
Within a CDATA section, only the CDEnd string is recognized as markup,
so that left angle brackets and ampersands may occur in their literal
form; they need not (and cannot) be escaped using "<" and "&".
CDATA sections cannot nest.
An example of a CDATA section, in which "<greeting>" and "</greeting>"
are recognized as character data, not markup:
<![CDATA[<greeting>Hello, world!</greeting>]]>
..
..
..
One more thing:
>If you are wondering about a negative string match, look at the perlre
>documentation, specifically negative lookahead and lookbehind
>assertions.
Yes I looked at it and tried the assertions quite a bit,
in this context /(.*)(?!string)/s it doesen't seem to work.
This however /(\w*)(?!string)/ seems to work but only if the
string has certain characters.
Don't know why.
I won't be on for a couple of days while I install a new raid array.
Anyway thanks for the help.
.
- Follow-Ups:
- Re: Serious Perl Regular Expression deficiency?
- From: castillo . bryan
- Re: Serious Perl Regular Expression deficiency?
- From: castillo . bryan
- Re: Serious Perl Regular Expression deficiency?
- From: Tad McClellan
- Re: Serious Perl Regular Expression deficiency?
- From: Matt Garrish
- Re: Serious Perl Regular Expression deficiency?
- References:
- Serious Perl Regular Expression deficiency?
- From: robic0
- Re: Serious Perl Regular Expression deficiency?
- From: castillo . bryan
- Serious Perl Regular Expression deficiency?
- Prev by Date: Re: BEGIN { package Foo; use Foo }
- Next by Date: Re: Serious Perl Regular Expression deficiency?
- Previous by thread: Re: Serious Perl Regular Expression deficiency?
- Next by thread: Re: Serious Perl Regular Expression deficiency?
- Index(es):
Relevant Pages
|