Re: HowTo parse huge Files
- From: "Mumia W." <paduille.4060.mumia.w+nospam@xxxxxxxxxxxxx>
- Date: Thu, 29 Mar 2007 16:03:25 GMT
On 03/29/2007 07:24 AM, cadetg@xxxxxxxxxxxxxx wrote:
Dear Perl Monks, I am developing at the moment a script which has to
parse 20GB files. The files I have to parse are some logfiles. My
problem is that it takes ages to parse the files. I am doing something
like this:
my %lookingFor;
# keys => different name of one subset
# values => array of one subset
my $fh = new FileHandle "< largeLogFile.log";
[1:] while (<$fh>) {
foreach my $subset (keys %lookingFor) {
foreach my $item (@{$subset}) {
[2:] if (<$fh> =~ m/$item/) {
You are aware that line 2 reads in a new chunk from $fh, and the old chunk read on line 1 is forgotten, don't you?
my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
$fh>;
You can open the write filehandle once and keep it open til you are done.
}
}
}
I've already tried to speed it up by using the regExp flag=>o by doing
something like this:
$isSubSet=buildRegexp(@allSubSets);
while (<$fh>) {
foreach my $subset (keys %lookingFor) {
if (&$isSubSet(<$fh>)) {
my $writeFh = new FileHandle ">> myout.log";
print $writeFh <$fh>;
}
}
}
sub buildRegexp {
my @R = @_; my $expr = join '||', map { "\$_[0] =~ m/\(To\|is\)\\:\\S
\+\\@\$R[$_ +]/io" } ( 0..$#R );
my $matchsub = eval "sub { $expr }";
if ($@) { $logger->error("Failed in building regex @R: $@"); return
ERROR; }
$matchsub;
}
I don't know how to optimize this more. Maybe it would be possible to
do something with "map"? I think the "o" flag didn't speed it up at
all. Also I've tried to split the one big file into a few small ones
and use some forks childs to parse each of the small ones. Also this
didn't help.
Thanks a lot for your help!
Cheers
-Marco
It might not be possible to get much faster with such large files, but try this out:
#!/usr/bin/perl
use strict;
use warnings;
use FileHandle;
use Data::Dumper;
use Alias;
my %lookingFor = (
houseware => [qw(wallpaper hangers doorknobs)],
);
my %lookingForRx = lookingForRx(%lookingFor);
my $fh = new FileHandle '< largeLogFile.log';
my $writeFh = new FileHandle '> myout.log';
while (my $line = <$fh>) {
foreach my $subset (keys %lookingForRx) {
if ($line =~ /$lookingForRx{$subset}/) {
print $writeFh $line;
}
}
}
$writeFh->close;
$fh->close;
#####################################
sub lookingForRx {
our (%oldHash, @oldArray);
local %oldHash = @_;
local @oldArray;
my %hash;
foreach my $subset (keys %oldHash) {
alias oldArray => $oldHash{$subset};
my $rx = do { local $" = '|'; "(@oldArray)" };
$hash{$subset} = qr/$rx/;
}
%hash;
}
__END__
I haven't really tested this other than to make sure it compiles.
.
- Follow-Ups:
- Re: HowTo parse huge Files
- From: John W. Krahn
- Re: HowTo parse huge Files
- References:
- HowTo parse huge Files
- From: cadetg@xxxxxxxxxxxxxx
- HowTo parse huge Files
- Prev by Date: Re: HowTo parse huge Files
- Next by Date: Re: multiple inheritance and instance data?
- Previous by thread: Re: HowTo parse huge Files
- Next by thread: Re: HowTo parse huge Files
- Index(es):
Relevant Pages
|
|