Re: Strange behavior when working with large files



* bjamin@xxxxxxxxx schrieb:
>
> I have been working on a strange problem I've been having. I am reading
> a series of large files (50 mb or so) in one at a time with:
> @lines = <FILE>;
> or (same behavior with each)
> while(<FILE>){
> push(@lines, $_);
> }
> The first time I read a file it will read into the array in about 2
> seconds. The second time I try to read a file in (the same size) it
> takes about 20 seconds. Everything is declared locally inside the loop
> so, everything is leaving scope. I am not sure why it is taking so much
> longer the second time.
>
> I have narrowed the problem down to a few different areas:
>
> 1. It seems that if I read the file into a large scaler by $/ = undef,
> the file gets read faster. So, I assume the slow down is taking place
> inside the spliting of the lines.

Seems so, on my system I get similiar results. If you could narrow your
problem in a few lines of code, feel free to post this small program.
This makes it easier to reproduce your problem. Just for testing, I've
written such a small script for you.


#!/usr/bin/perl -w
use strict;
use warnings;
use Benchmark;
my $file = '50mb.txt';
for ( 1 .. 4 ) {
print timestr( timeit( 1, sub {
# local $/ = undef;
open my $fh, '<', $file or die $!;
# my @lines = <$fh>;
my @lines; push @lines, $_ while <$fh>;
} ) ), "\n";
}
__END__


The file I'm reading here consists of 1.5 million lines (50MB all
together). I get:

4 wallclock secs ( 3.98 usr + 0.14 sys = 4.13 CPU) @ 0.24/s (n=1)
34 wallclock secs (33.89 usr + 0.16 sys = 34.05 CPU) @ 0.03/s (n=1)
27 wallclock secs (26.17 usr + 0.13 sys = 26.30 CPU) @ 0.04/s (n=1)
28 wallclock secs (27.77 usr + 0.20 sys = 27.97 CPU) @ 0.04/s (n=1)

With localizing of $/ enabled (slurp mode), I get:

1 wallclock secs ( 0.77 usr + 0.09 sys = 0.86 CPU) @ 1.16/s (n=1)
0 wallclock secs ( 0.72 usr + 0.17 sys = 0.89 CPU) @ 1.12/s (n=1)
0 wallclock secs ( 0.72 usr + 0.23 sys = 0.95 CPU) @ 1.05/s (n=1)
1 wallclock secs ( 0.70 usr + 0.23 sys = 0.94 CPU) @ 1.07/s (n=1)

With "my @lines = <$fh>" instead of the while loop, I get:

22 wallclock secs (16.13 usr + 5.22 sys = 21.34 CPU) @ 0.05/s (n=1)
36 wallclock secs (35.38 usr + 0.22 sys = 35.59 CPU) @ 0.03/s (n=1)
6 wallclock secs ( 5.58 usr + 0.14 sys = 5.72 CPU) @ 0.17/s (n=1)
37 wallclock secs (36.88 usr + 0.17 sys = 37.05 CPU) @ 0.03/s (n=1)

Curious, I don't know why the third attempt is breaking ranks.

I have run my script with another input file, too; one with considerable
fewer newlines (also 50MB, approx 200,000 lines). I get the following
result for the loop:

1 wallclock secs ( 1.34 usr + 0.14 sys = 1.48 CPU) @ 0.67/s (n=1)
12 wallclock secs (11.45 usr + 0.19 sys = 11.64 CPU) @ 0.09/s (n=1)
15 wallclock secs (14.48 usr + 0.19 sys = 14.67 CPU) @ 0.07/s (n=1)
10 wallclock secs (10.45 usr + 0.22 sys = 10.67 CPU) @ 0.09/s (n=1)

And for the version with "my @lines = <$fh>":

3 wallclock secs ( 3.06 usr + 0.33 sys = 3.39 CPU) @ 0.29/s (n=1)
57 wallclock secs (55.86 usr + 0.31 sys = 56.17 CPU) @ 0.02/s (n=1)
60 wallclock secs (59.20 usr + 0.23 sys = 59.44 CPU) @ 0.02/s (n=1)
58 wallclock secs (57.39 usr + 0.22 sys = 57.61 CPU) @ 0.02/s (n=1)

Seems, that Perl needs as more time as longer the lines are. Assuming
this, I run this script with a 50 MB file with only one newline in the
middle, whereas all attempts need (nearly) the same time.

269 wallclock secs (185.00 usr + 81.86 sys = 266.86 CPU) @ 0.00/s (n=1)
277 wallclock secs (184.42 usr + 87.11 sys = 271.53 CPU) @ 0.00/s (n=1)
276 wallclock secs (183.98 usr + 86.03 sys = 270.02 CPU) @ 0.00/s (n=1)
272 wallclock secs (184.74 usr + 85.03 sys = 269.77 CPU) @ 0.00/s (n=1)

>
> 2. If I try to append to one large array, rather then rewritting to a
> different array, the slow down does not occur. So it seems Perl has a
> hard time with the memory it already has but its fine with memory it
> just took from the system?

Right. In my example: If I move the declaration "my @lines" in front of
the for-loop, I get for the first file with 1.5 million lines (just the
for-loop matters):

4 wallclock secs ( 3.02 usr + 0.25 sys = 3.27 CPU) @ 0.31/s (n=1)
3 wallclock secs ( 2.95 usr + 0.31 sys = 3.27 CPU) @ 0.31/s (n=1)
7 wallclock secs ( 2.86 usr + 0.27 sys = 3.13 CPU) @ 0.32/s (n=1)
9 wallclock secs ( 3.11 usr + 0.34 sys = 3.45 CPU) @ 0.29/s (n=1)

Actually this creates an array with 6 million elements. The performance
penalty in the second half is just because my machine has only 512 MB
RAM and needs to swap around. Hence the results for the file with only
200,000 lines is looking much better (no swapping is needed):

1 wallclock secs ( 1.11 usr + 0.17 sys = 1.28 CPU) @ 0.78/s (n=1)
2 wallclock secs ( 1.11 usr + 0.17 sys = 1.28 CPU) @ 0.78/s (n=1)
1 wallclock secs ( 1.03 usr + 0.28 sys = 1.31 CPU) @ 0.76/s (n=1)
1 wallclock secs ( 1.09 usr + 0.23 sys = 1.33 CPU) @ 0.75/s (n=1)

>
> 3. The problem does not seem to happen in Linux, but I'm working
> Windows.

I have run this on Windows XP SP2 with ActiveState's Perl 5.8.6.

>
> Any suggestions for a workaround? Has anyone else seen this? Thanks in
> advance.

I have no suggestions for a workaround ;-(

Yes, I have seen it now ;-)

But: It is really necessary to read in the whole file? Would you compare
the first with the last line in worst cases? Perhaps you could give your
algorithm a second thought.

regards,
fabian
.



Relevant Pages

  • Re: perl vs Unix grep
    ... variable indexCount on array and reintialized evry time. ... Perl is langauge to make things work at any cost. ... > grep but the shell scripts that use ... As far as I can tell from reading and research ...
    (comp.lang.perl)
  • Re: read file backwards
    ... I need to read a file in Perl backwards but - ... I have read tips about reading the file into an array and then reading ... way to read it backwards without having to use any of the above. ...
    (comp.lang.perl.misc)
  • Strange behavior when working with large files
    ... I am reading ... The first time I read a file it will read into the array in about 2 ... The second time I try to read a file in it ... hard time with the memory it already has but its fine with memory it ...
    (comp.lang.perl.misc)
  • read file backwards
    ... I need to read a file in Perl backwards but - ... I have read tips about reading the file into an array and then reading ... way to read it backwards without having to use any of the above. ...
    (comp.lang.perl.misc)
  • Re: split by word using | as delimiter
    ... > Nut I frequently see beginners *explicitly* reading all the lines in a ... > file into an array, and then iterating over that array, as the OP did ... In any case, as a perl dabbler, I can tell you why some of these ...
    (comp.lang.perl.misc)