Re: Searching large files with a regex and a list
- From: Bob Walton <see.sig@xxxxxxxxxxxxxxxx>
- Date: Wed, 31 May 2006 03:47:34 GMT
Channing wrote:
....
I would like some suggestions (constructive) on some code I'm writing.....
My Perl is rusty and that's reflected in the sample I'm posting. Here
is what I have to tackle. I have Gig files to parse for two different
RegEx's. Within those RegEx's there is a variable that is a list of
18,000+ numbers. I'm looking for some suggestions on what I can do to
speed things up, or at least make things better.
------- Code Begin ---------Here you're missing:
#!/usr/bin/perl
use warnings;
use strict;
Both should be in place during development at least.
my $match=0;
my $nonMatch=0;
open(DN_LIST, "<","big_list");
Always check the results of open() for success. Something like:
open DN_LIST,"<","big_list" or
die "Oops, big_list open failed, $!";
my @list = <DN_LIST>;
@list=sort(@list);
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
}
@list = join('|',@list);
While this DWYM, it would be better and clearer as:
my $list = join '|',@list;
The result of join() is a scalar, not an array. Change references to $list[0] below to just $list.
while (<>)
{
if ( /^123456\d{8}($list[0])/o or /^9876(91|92)\d{24}($list[0])/o )
This [untested] might (or might not) go faster with the leading part alternated, as in:
if ( /^((123456\d{8})|(9876(91|92)\d{24}))($list)/o )
Since you're not using the parenthetical groups to assign number variables, this [untested] might be better still:
if ( /^(?:(?:123456\d{8})|(?:9876(?:91|92)\d{24}))(?:$list)/o )
Beyond that, if the nature of your data is such that the \d{8} and \d{24} bits will always match (that is, you always have that many digits present at those spots in the data, never anything else), then you might consider using substr and eq to test parts of your strings for matches, since your regex then boils down to character by character string matches. Would that be faster? I don't know in your case, but it usually is.
Another possibility is to use the strings in @list as keys to a hash. Then, instead of testing your data string against 18000 possible strings, take the possible strings and see if they are present as keys in the hash. One would have to keep track of and test the possible lengths of strings, but even with that overhead, this approach should be a big winner time-wise -- a few hash lookups instead of 18000 string comparisons.
{
$match++;
}
else
{
$nonMatch++;
}
}
print "Match Count:" . ${match} . "\n";
print "Non-Match Count:" . ${nonMatch} . "\n";
------- Code End ---------
--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl
.
- References:
- Searching large files with a regex and a list
- From: Channing
- Searching large files with a regex and a list
- Prev by Date: Re: How to match characters in different locations within string
- Next by Date: Re: Searching large files with a regex and a list
- Previous by thread: Searching large files with a regex and a list
- Next by thread: Re: Searching large files with a regex and a list
- Index(es):
Relevant Pages
|
|