Re: regex problem
- From: sln@xxxxxxxxxxxxxxx
- Date: Sun, 14 Jun 2009 18:24:50 -0700
On Sun, 14 Jun 2009 15:26:16 -0400, Charlton Wilbur <cwilbur@xxxxxxxxxxxxxx> wrote:
"BM" == Ben Morrow <ben@xxxxxxxxxxxx> writes:
BM> Quoth Charlton Wilbur <cwilbur@xxxxxxxxxxxxxx>:
>> >>>>> "s" == sstark <sstark@xxxxxxxxxx> writes:
BM> [ sstark's code was
BM> $line =~ s/^$prev//;
BM> ]
s> Why isn't it deleting the value of $prev in $line?
>> Because ^ doesn't do what you think it does, and it only works in
>> the code you have there out of pure luck and coincidence.
BM> Please explain further. ^ means 'match at the beginning of the
BM> string', unless /m is given, in which case it means 'match at
BM> the beginning of any line'. How is this not what the OP thought
BM> it meant?
Because the strings he wants are at the start of the strings he's
looking at purely by accident -- it's not part of his specification.
Charlton
I don't think its by coincidence.
The carret ^ in this:
/^(.*?href\s*=\s*\")([^\"]+)(\".*)/i
is not actually needed since .*? will only grab the first instance of the
matching pattern, everything from the beginning of the line.
However he needs /is. So the basic principle is sound, still it doesen't
matter if the ^ is there or not. No global modifier anyway, the pos() of
the match is not remembered in the while, search will be renewed
at position 0.
On top of that the contents of (.*?href\s*=\s*\") is used as a pattern
in a later substitution regex (but he had a metachar quoting problem). The line
is still the same until he does the substitution, but he needs to quote metachar
when using the capture from the while regex, as a pattern in the lower substitution
regex. It finds the exact same thing because the line didn't change and it would
be the first found in the substitution regex.
The fact that he does this in a while() statement is misleading until you
read it a little more closely.
What he is in fact trying to do is a cheap buffering method that appends file
stream data, line by line, to the buffer ($line) after the substitution, thus
avoiding the global modifier /g and non-substitution.
His code doesen't show it, but it would probably have to do something like below
for it to work.
while (defined ($buff = <DATA>))
{
$line .= $buff;
while($line =~ /^(.*?href\s*=\s*\")([^\"]+)(\".*)/i){ #"
while($line =~ /^(.*?href\s*=\s*")([^"]+)"/is) # fixed up
{
my $prev =$1;
my $href =$2;
$prev = quotemeta $prev; # the fix
$href = quotemeta $href; # the fix
$line =~ s/^$prev//;
$line =~ s/^$href//;
}
}
# see whats left in $line
In reality the ^ shouldn't be needed, but doesen't seem to hurt.
This method is pretty slow, but its cheap buffering. The alternative is
something like below.
-sln
-------------
use strict;
use warnings;
# /(?:(.*?href\s*=\s*)(["'])(.*?)\2)|(.+)/isg
my ($line,$buff) = (''.'');
my $count = 1;
while (defined ($_ = <DATA>))
{
$line = $buff.$_;
while( $line =~ /(?:(.*?href\s*=\s*)(["'])(.*?)\2)|(.+)/isg )
{
if (defined $1) {
print "Pass ".$count++.":\n------------\n";
print "prev:\n".$1.$2."\n"; # previous, print to file
print "val:\n".$3."\n"; # href, modify & print to file
print "end:\n".$2."\n"; # closing quote, print to file
}
elsif (defined $4) {
$buff = $4; # remainder, buffer it
}
}
}
if (length $buff) {
print "Pass ".$count++.":\n------------\n";
print "buff:\n".$buff."\n"; # remainding buffer, print to file
}
__DATA__
<pre>
some junk content
<li>description <a href
="overview_mh.html#overview">(1)</a>,
<a href = 'catalog.html#catalog'>(2)</a>
</pre>
.
- References:
- regex problem
- From: sstark
- Re: regex problem
- From: Charlton Wilbur
- Re: regex problem
- From: Ben Morrow
- Re: regex problem
- From: Charlton Wilbur
- regex problem
- Prev by Date: FAQ 3.3 Is there a Perl shell?
- Next by Date: Re: regex problem
- Previous by thread: Re: regex problem
- Next by thread: Re: regex problem
- Index(es):
Relevant Pages
|