Re: Searching large files with a regex and a list



Channing wrote:
....
I would like some suggestions (constructive) on some code I'm writing.
My Perl is rusty and that's reflected in the sample I'm posting. Here
is what I have to tackle. I have Gig files to parse for two different
RegEx's. Within those RegEx's there is a variable that is a list of
18,000+ numbers. I'm looking for some suggestions on what I can do to
speed things up, or at least make things better.
....
------- Code Begin ---------
#!/usr/bin/perl

Here you're missing:

use warnings;
use strict;

Both should be in place during development at least.

my $match=0;
my $nonMatch=0;

open(DN_LIST, "<","big_list");

Always check the results of open() for success. Something like:

open DN_LIST,"<","big_list" or
die "Oops, big_list open failed, $!";

my @list = <DN_LIST>;
@list=sort(@list);
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
}
@list = join('|',@list);

While this DWYM, it would be better and clearer as:

my $list = join '|',@list;

The result of join() is a scalar, not an array. Change references to $list[0] below to just $list.


while (<>)
{
if ( /^123456\d{8}($list[0])/o or /^9876(91|92)\d{24}($list[0])/o )

This [untested] might (or might not) go faster with the leading part alternated, as in:

if ( /^((123456\d{8})|(9876(91|92)\d{24}))($list)/o )

Since you're not using the parenthetical groups to assign number variables, this [untested] might be better still:

if ( /^(?:(?:123456\d{8})|(?:9876(?:91|92)\d{24}))(?:$list)/o )

Beyond that, if the nature of your data is such that the \d{8} and \d{24} bits will always match (that is, you always have that many digits present at those spots in the data, never anything else), then you might consider using substr and eq to test parts of your strings for matches, since your regex then boils down to character by character string matches. Would that be faster? I don't know in your case, but it usually is.

Another possibility is to use the strings in @list as keys to a hash. Then, instead of testing your data string against 18000 possible strings, take the possible strings and see if they are present as keys in the hash. One would have to keep track of and test the possible lengths of strings, but even with that overhead, this approach should be a big winner time-wise -- a few hash lookups instead of 18000 string comparisons.

{
$match++;
}
else
{
$nonMatch++;
}
}

print "Match Count:" . ${match} . "\n";
print "Non-Match Count:" . ${nonMatch} . "\n";

------- Code End ---------

--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl
.



Relevant Pages

  • Re: sort unique
    ... given that a hash table is not ... IMO if the vendor's algorithm does something "obvious", ... function to eliminate keys that hash to the same bucket per some ... strings of random lengths, and two strings are ...
    (comp.lang.lisp)
  • The certification password of Internet Explorer 7 and operation of auto complete
    ... About the certification password of Internet Explorer and operation ... By remembering the strings that are input in the following text ... In this registry, there are values whose name is a string of 42 bytes ... We cannot guess the original strings from the hash value, ...
    (Bugtraq)
  • Re: Order-preserving hashes?
    ... Hash: RIPEMD160 ... the number of possible source strings in the domain ... arbitrary length as keys. ... The ration of input space and output space is ...
    (sci.crypt)
  • Re: Traversing a hash with array refs as keys?
    ... Hash keys can only be strings. ... So is your key an array or an array reference? ... As mentioned above keys must be strings. ...
    (comp.lang.perl.misc)
  • Re: Dictionary Keys question
    ... keys are in numerical order. ... A hash is for quickly looking up data. ... complicated hash-function would be used, ... so strings don't end up in order. ...
    (comp.lang.python)