Re: Searching large files with a regex and a list



Channing wrote:

I would like some suggestions (constructive) on some code I'm writing.
My Perl is rusty and that's reflected in the sample I'm posting. Here
is what I have to tackle. I have Gig files to parse for two different
RegEx's. Within those RegEx's there is a variable that is a list of
18,000+ numbers. I'm looking for some suggestions on what I can do to
speed things up, or at least make things better.


------- Code Begin ---------
#!/usr/bin/perl

my $match=0;
my $nonMatch=0;

open(DN_LIST, "<","big_list");
my @list = <DN_LIST>;
@list=sort(@list);
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
}
@list = join('|',@list);


while (<>)
{
if ( /^123456\d{8}($list[0])/o or /^9876(91|92)\d{24}($list[0])/o )
{
$match++;
}
else
{
$nonMatch++;
}
}

print "Match Count:" . ${match} . "\n";
print "Non-Match Count:" . ${nonMatch} . "\n";

------- Code End ---------

According to the FAQ:

perldoc -q "How do I efficiently match many regular expressions at once"

You need to do something like this (UNTESTED):

#!/usr/bin/perl
use warnings;
use strict;

my $match = 0;
my $nonMatch = 0;

open DN_LIST, '<', 'big_list' or die "Cannot open 'big_list' $!";

my @list = map {
chomp;
tr/ //d;
qr/^(?:123456\d{8}|98769[12]\d{24})$_/;
} <DN_LIST>;

close DN_LIST;

LINE:
while ( my $line = <> ) {
for my $regex ( @list ) {
if ( $line =~ /$regex/ ) {
$match++;
next LINE;
}
}
$nonMatch++;
}

print "Match Count:$match\n";
print "Non-Match Count:$nonMatch\n";

__END__



John
--
use Perl;
program
fulfillment
.



Relevant Pages