Re: Searching large files with a regex and a list




Brian McCauley wrote:
Channing wrote:

I would like some suggestions (constructive) on some code I'm writing.
My Perl is rusty and that's reflected in the sample I'm posting. Here
is what I have to tackle. I have Gig files to parse for two different
RegEx's. Within those RegEx's there is a variable that is a list of
18,000+ numbers. I'm looking for some suggestions on what I can do to
speed things up, or at least make things better.

Thanks in advance for your time.

------- Code Begin ---------
#!/usr/bin/perl

my $match=0;
my $nonMatch=0;

open(DN_LIST, "<","big_list");
my @list = <DN_LIST>;
@list=sort(@list);
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
}
@list = join('|',@list);

Joining multiple RegEx into one like this is _less_ efficient than
simply looping over @list, which is why the answer given in the FAQ
(yes, your question is a FAQ) does not suggest doing so. (It does
suggest using qr// to precompile the RegEx though...

$_=qr/$_/; # Inside your loop

Well, I tried a number of the suggestions. The best combination (of
what I tried) is posted below. This took the runtime from 2 hours to
1.5 minutes! In a nutshell, the suggestion to use a hash in-place of
the RegEx was the break-through. Thanks to all for their time and
contribution to the list!

Regards,

Channing

----- Code Begin -----


#!/usr/bin/perl

my $nonMatched=0;
my $matched=0;
my %dnList;
my $dnFile = "big_list";

open(DN_LIST, "<","${dnFile}") or die "Cannot open ${dnFile} $!";
my @list = <DN_LIST>;
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
${dnList{"$_"}} = 1;
}


while (<>)
{
if ( ( /^123456/o and (exists $dnList{substr($_,14,10)})) or
( /^9876(21|99)/o and (exists $dnList{substr($_,29,10)})) )
{
$matched++;
}
else
{
$nonMatched++;
}
}

print "Matched:" . ${matched} . "\n";
print "Non-Matched:" . ${nonMatched} . "\n";

----- Code Ends -----

.



Relevant Pages

  • Re: Regular expression to find <tr> tags in 2nd level HTML tables
    ... >> problem with the regex. ... and my source HTML does not include any of the problems covered ... If the FAQ included any examples of the use of ... With regards to the unhelpful advice to stop using Perl, ...
    (comp.lang.perl.misc)
  • Re: Regular expression to find <tr> tags in 2nd level HTML tables
    ... >> problem with the regex. ... and my source HTML does not include any of the problems covered ... If the FAQ included any examples of the use of ... With regards to the unhelpful advice to stop using Perl, ...
    (comp.lang.perl)
  • Re: positions of matches
    ... foreach my $c ... With the luxury of the /x modifier on your regular expressions you can make ... my $regex = qr { ... And the same applies to your loops - they would be very much more readable if ...
    (perl.beginners)
  • Re: positions of matches
    ... foreach my $c ... my $regex = qr { ... And the same applies to your loops - they would be very much more readable if ... but Thunderbird removes code indentation. ...
    (perl.beginners)
  • Re: Simple "Not" Match?
    ... foreach ) ... I'm sure with enough fiddling a regex could be constructed that handles ... could involve a lot of backtracking during regex matching which might take ... as much time as just splitting the string up. ...
    (microsoft.public.dotnet.framework)