Re: Serious Perl Regular Expression deficiency?



On 23 Dec 2005 20:13:08 -0800, castillo.bryan@xxxxxxxxx wrote:

>robic0 wrote:
>> I don't see a solution to this problem that
>> regular expressions can't exclude a string when
>> processing. It can exclude individual characters
>> fine. I started doing Perl 2 years ago and have
>> run into this nagging problem several times.
>>
>> After extensive read on the Perl docs on re's
>> (especially in the last 2 days) I have come to the
>> conclusion that regular expressions have a serious
>> deficiency. This is serious because the not string
>> is a fundimental basic logic idea in a search from
>> a touted master search engine or should be.
>> To a degree it works with a known subset, but it
>> won't work to the degree shown below. This is a
>> serious flaw in regualar expressions!
>>
>> I hope you masters can prove me wrong! I really do.
>> If not I would hope that the Perl authors can provide
>> some insight on when this construct can be fixed,
>> aka implemented.
>>
>> Beat this code if you can (you can't). Don't look
>> at the code in this example, look instead at the
>> output.
>> Don't comment on any code syntax because thats not
>> welcome or the point.
>> Instead, refer you comments to the output ID's.
>>
>> If you know of a way Perl regex can do this
>> please reply. I'm almost %99 sure Perl regex
>> can't do this. In fact the %1 is thrown out here
>> to either verify that or prove otherwise.
>>
>
>Its not clear what "this" is. Are you asking if perl can do a negative
>match on a string, pull out XML comments with a regex, or both?
>
>If you are wondering about a negative string match, look at the perlre
>documentation, specifically negative lookahead and lookbehind
>assertions.
>
>If you want to pull out the contents of XML comments you could do this.
>
>
>sub test_xml_comment_parse {
> my ($xml) = @_;
> print "XML\n", '-' x 40, "\n", $xml, "\n", '-' x 40, "\n";
> while ($xml =~ s/<!--(.*?)-->//ms) {
> print "Comment [$1]\n"
> }
> print "\n", '-' x 40, "\n\n\n";
>}
>
>my $gabage1 = '
><big name="asdf" date="33" >
> asdf
> <!-- howdy folks -->
> <in2>jjjj</in2>
> <!-- and still more -->
> asdfb
></big>
>';
>
>my $gabage2 = '
><big name="asdf" date="33" >
> asdf
> <!-- howdy folks %SYSTEM is down <who cares?> -->
> <in2>jjjj</in2>
> <!-- and still more -->
> asdfb
></big>
>';
>
>test_xml_comment_parse($_) foreach ($gabage1,$gabage2);
>
>output:
>
>XML
>----------------------------------------
>
><big name="asdf" date="33" >
> asdf
> <!-- howdy folks -->
> <in2>jjjj</in2>
> <!-- and still more -->
> asdfb
></big>
>
>----------------------------------------
>Comment [ howdy folks ]
>Comment [ and still more ]
>
>----------------------------------------
>
>
>XML
>----------------------------------------
>
><big name="asdf" date="33" >
> asdf
> <!-- howdy folks %SYSTEM is down <who cares?> -->
> <in2>jjjj</in2>
> <!-- and still more -->
> asdfb
></big>
>
>----------------------------------------
>Comment [ howdy folks %SYSTEM is down <who cares?> ]
>Comment [ and still more ]
>
>----------------------------------------
>
>
>
>
>
>
>
>There is a problem though. If you need to retrieve data from xml
>documents, you should generally use an XML parser instead of using your
>own regular expressions.
>
>Here is 1 case where the code I posted above would pull out the text
>"not really a comment", that isn't really a comment.
>
><test_xml>
> <value>
> <![CDATA[ <!-- not really a comment --> ]]>
> </value>
></test_xml>


Thanks alot

Yes the first occurance (?) does the trick /<!--(.*?)-->/
And given nesting is not allowed here this will do it.
This had worked for me before, I should have stuck with it.
The //m is not really of help here since the xml could
be without newlines.

I found xml specs from
http://www.w3.org/TR/1998/REC-xml-19980210#sec-cdata-sect
I will use that to finish this code.

About the CDATA thing you mentioned. No, thats not really a
problem. The order of the regex is such that "all" non-markup
items are processed out first.

So in this case all CDATA will be removed first followed by
all comments and any other weird ones like versioning.

I like the specs, it makes it easy to write the regex.
quote:

CDSect ::= CDStart CData CDEnd
[19] CDStart ::= '<![CDATA['
[20] CData ::= (Char* - (Char* ']]>' Char*))
[21] CDEnd ::= ']]>'


Within a CDATA section, only the CDEnd string is recognized as markup,
so that left angle brackets and ampersands may occur in their literal
form; they need not (and cannot) be escaped using "&lt;" and "&amp;".
CDATA sections cannot nest.

An example of a CDATA section, in which "<greeting>" and "</greeting>"
are recognized as character data, not markup:

<![CDATA[<greeting>Hello, world!</greeting>]]>

..
..
..
One more thing:
>If you are wondering about a negative string match, look at the perlre
>documentation, specifically negative lookahead and lookbehind
>assertions.

Yes I looked at it and tried the assertions quite a bit,
in this context /(.*)(?!string)/s it doesen't seem to work.
This however /(\w*)(?!string)/ seems to work but only if the
string has certain characters.
Don't know why.

I won't be on for a couple of days while I install a new raid array.
Anyway thanks for the help.

.



Relevant Pages

  • Re: Serious Perl Regular Expression deficiency?
    ... I started doing Perl 2 years ago and have ... > conclusion that regular expressions have a serious ... This is serious because the not string ... If you want to pull out the contents of XML comments you could do this. ...
    (comp.lang.perl.misc)
  • Serious Perl Regular Expression deficiency?
    ... After extensive read on the Perl docs on re's ... conclusion that regular expressions have a serious ... This is serious because the not string ... howdy folks --> ...
    (comp.lang.perl.misc)
  • Re: perl question
    ... that an exact string match is only ... function will be more efficient than regular expression matching. ... For a gentle introduction to regular expressions, ... You may find that the Perl ...
    (comp.os.vms)
  • Re: Reasons for preferring Lisp, and for what
    ... >> can then apply the other Perl tools to. ... > Trying to parse HTML or XML with regular expressions is really a bad ...
    (comp.lang.lisp)
  • Re: Regular Expressions in C
    ... I learned some Perl and I read the Llamabook. ... > from a string provided by the user using regular expressions and match ... maybe you want more than regular expressions. ...
    (comp.os.linux.development.apps)