Re: File size too big for perl processing



In article
<df436918-985a-408e-a89e-f61b1cf779fc@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
Cheez <danieldharkness@xxxxxxxxx> wrote:

Hi, I posted this to perl.beginners as well and will make sure
comments go to both groups.

I have a big file of 16-letter words that I am using as "bait" to
capture larger words in a raw data file. I loop through all of the
rawdata with a single word for 1) matches and 2) to associate the raw
data with the word. I then go to the next line in the word list and
repeat.

hashsequence16.txt is the 16-letter word file (203MB)

Hmm. How many 16-letter words are in this file? I see from your code
that the file contains the word and a frequency count. Estimating at
about 25 bytes per word, that represents 8 million words.

rawdata.txt is the raw data file (93MB)

I have a counter in the code to tell me how long it's taking to
process... 9500 hours or so to complete... I definitely have time to
pursue other alternatives.

Scripting with perl is a hobby and not a vocation so I apologize in
advance for ugly code. Any suggestions/comments would be greatly
appreciated.

Thanks,
Dan

========================


You should have

use strict;
use warnings;

in your program. This is very important if you wish to get help from
this newsgroup.

print "**fisher**";

$flatfile = "newrawdata.txt";
# 95MB in size

$datafile = "hashsequence16.txt";
# 203MB in size

my $filesize = -s "hashsequence16.txt";
# for use in processing time calculation

open(FILE, "$flatfile") || die "Can't open '$flatfile': $!\n";
open(FILE2, "$datafile") || die "Can't open '$flatfile': $!\n";
open (SEQFILE, ">fishersearch.txt") || die "Can't open '$seqparsed': $!
\n";

You should be using lexically-scoped file handle variables, the
3-argument version of open, and 'or' instead of '||'.


@preparse = <FILE>;
@hashdata = <FILE2>;

Well at least you have enough memory to read the files into memory.
That helps. If you apply the chomp operator to these arrays, you can
save yourself some repetitive processing later:

chomp(@preparse);
chomp(@hashdata);


close(FILE);
close(FILE2);


for my $list1 (@hashdata) {
# iterating through hash16 data



$finish++;

if ($finish ==10 ) {
# line counter

$marker = $marker + $finish;

$finish =0;

$left = $filesize - $marker;

printf "$left\/$filesize\n";
# this prints every 17 seconds
}

When you are asking for help, it is best to leave out irrelevant
details such as periodic printing statements. It doesn't help anybody
help you.


($line, $freq) = split(/\t/, $list1);

for my $rawdata (@preparse) {
# iterating through rawdata

$rawdata=~ s/\n//;

No need for this if you chomp the arrays after reading.


if ($rawdata =~ m/$line/) {
# matching hash16 word with rawdata line

my $first_pos = index $rawdata,$line;

You first use a regex to find if $line appears in $rawdata, then use
index to find out where it appears. Just test the return value from
index to see if the substring appears. It will be -1 if it does not.
This will give you a significant speed-up.


print SEQFILE "$first_pos\t$rawdata\n";
# printing to info to new file

}

}

print SEQFILE "PROCESS\t$line\n";
# printing hash16 word and "process"

}

You only make one pass through FILE2, so you can save some memory by
processing the contents of this file one line at a time, instead of
reading it into the @hashdata array. It looks like you could also swap
the order of the for loops and only make one pass through FILE,
instead, but that may take more memory.

It is difficult to see why this program will take 9500 hours to run.
Make the above changes and try again. Without your data files or a look
at some sample data, it is difficult for anyone to really help you.

--
Jim Gibson
.



Relevant Pages

  • Problem marshalling memory allocated in C++ DLL
    ... Marshal.Copy(rawData, data, 0, dataLength); ... If I allocate the memory with the GPTR flag instead of GHND, ... the C# marshalling code works and my data is correct. ...
    (microsoft.public.dotnet.framework.interop)
  • Massive memory leak in WinCE Web Service ???
    ... Each time the web page is drawn ~120K(the size of the BMP) of memory will be leaked. ... If I comment out the call to BinaryWrite the memory leak goes away. ... It looks like BinaryWrite is making its own copy of data, but never seems to free it. ... Response.BinaryWrite RawData ...
    (microsoft.public.scripting.vbscript)
  • Massive memory leak in WinCE Web Service ???
    ... Each time the web page is drawn ~120K(the size of the BMP) of memory will be leaked. ... If I comment out the call to BinaryWrite the memory leak goes away. ... It looks like BinaryWrite is making its own copy of data, but never seems to free it. ... Response.BinaryWrite RawData ...
    (microsoft.public.windowsce.platbuilder)
  • Massive memory leak in WinCE Web Service ???
    ... Each time the web page is drawn ~120K(the size of the BMP) of memory will be leaked. ... If I comment out the call to BinaryWrite the memory leak goes away. ... It looks like BinaryWrite is making its own copy of data, but never seems to free it. ... Response.BinaryWrite RawData ...
    (microsoft.public.inetserver.asp.general)
  • Re: Perl Script runs to slow
    ... $marker is based on the line number and $filesize is based on the number of bytes in the file so this calculation makes no sense. ... # iterating through rawdata ... # matching hash16 word with rawdata line ...
    (perl.beginners)