Re: HowTo parse huge Files



On 03/29/2007 07:24 AM, cadetg@xxxxxxxxxxxxxx wrote:
Dear Perl Monks, I am developing at the moment a script which has to
parse 20GB files. The files I have to parse are some logfiles. My
problem is that it takes ages to parse the files. I am doing something
like this:

my %lookingFor;
# keys => different name of one subset
# values => array of one subset

my $fh = new FileHandle "< largeLogFile.log";
[1:] while (<$fh>) {
foreach my $subset (keys %lookingFor) {
foreach my $item (@{$subset}) {
[2:] if (<$fh> =~ m/$item/) {

You are aware that line 2 reads in a new chunk from $fh, and the old chunk read on line 1 is forgotten, don't you?


my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
$fh>;

You can open the write filehandle once and keep it open til you are done.


}
}
}

I've already tried to speed it up by using the regExp flag=>o by doing
something like this:

$isSubSet=buildRegexp(@allSubSets);
while (<$fh>) {
foreach my $subset (keys %lookingFor) {
if (&$isSubSet(<$fh>)) {
my $writeFh = new FileHandle ">> myout.log";
print $writeFh <$fh>;
}
}
}
sub buildRegexp {
my @R = @_; my $expr = join '||', map { "\$_[0] =~ m/\(To\|is\)\\:\\S
\+\\@\$R[$_ +]/io" } ( 0..$#R );
my $matchsub = eval "sub { $expr }";
if ($@) { $logger->error("Failed in building regex @R: $@"); return
ERROR; }
$matchsub;
}

I don't know how to optimize this more. Maybe it would be possible to
do something with "map"? I think the "o" flag didn't speed it up at
all. Also I've tried to split the one big file into a few small ones
and use some forks childs to parse each of the small ones. Also this
didn't help.

Thanks a lot for your help!

Cheers
-Marco


It might not be possible to get much faster with such large files, but try this out:

#!/usr/bin/perl
use strict;
use warnings;
use FileHandle;
use Data::Dumper;
use Alias;

my %lookingFor = (
houseware => [qw(wallpaper hangers doorknobs)],
);

my %lookingForRx = lookingForRx(%lookingFor);


my $fh = new FileHandle '< largeLogFile.log';
my $writeFh = new FileHandle '> myout.log';

while (my $line = <$fh>) {
foreach my $subset (keys %lookingForRx) {
if ($line =~ /$lookingForRx{$subset}/) {
print $writeFh $line;
}
}
}


$writeFh->close;
$fh->close;

#####################################

sub lookingForRx {
our (%oldHash, @oldArray);
local %oldHash = @_;
local @oldArray;

my %hash;
foreach my $subset (keys %oldHash) {
alias oldArray => $oldHash{$subset};
my $rx = do { local $" = '|'; "(@oldArray)" };
$hash{$subset} = qr/$rx/;
}
%hash;
}


__END__

I haven't really tested this other than to make sure it compiles.


.



Relevant Pages

  • Re: HowTo parse huge Files
    ... The files I have to parse are some logfiles. ... How many key-value pairs does %lookingFor have? ... value is a reference to an array which is holding in average 20 items. ...
    (comp.lang.perl.misc)
  • Parsing pipe delimited file
    ... the piece of the UPC that I need and puts it into a hash. ... @artists) arrays inside the hash. ... Parse next 16 lines and associate ...
    (perl.beginners)
  • Re: Conditional in regex
    ... trying to parse a config file with key/value pairs seperated by white space ... My solution has been to parse it with something simple -- ... -- but the config definitions contained in those curly brackets are ... to do is assign each left hand value as the key in a hash. ...
    (perl.beginners)
  • Re: help with pyparsing
    ... Hash: SHA1 ... I have the following lines that I would like to parse in python using ... from pprint import pprint ...
    (comp.lang.python)
  • Re: [PHP] Monitor a WP website
    ... Okay, then get_file_contents, parse between the tags that would ... the stuff you want to monitor, hash, store that, and do what I said. ... where the comments I'm interested in keeping tabs on are being kept -- like ...
    (php.general)