Re: File size too big for perl processing
- From: Jim Gibson <jimsgibson@xxxxxxxxx>
- Date: Mon, 30 Jun 2008 12:06:41 -0700
In article
<df436918-985a-408e-a89e-f61b1cf779fc@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
Cheez <danieldharkness@xxxxxxxxx> wrote:
Hi, I posted this to perl.beginners as well and will make sure
comments go to both groups.
I have a big file of 16-letter words that I am using as "bait" to
capture larger words in a raw data file. I loop through all of the
rawdata with a single word for 1) matches and 2) to associate the raw
data with the word. I then go to the next line in the word list and
repeat.
hashsequence16.txt is the 16-letter word file (203MB)
Hmm. How many 16-letter words are in this file? I see from your code
that the file contains the word and a frequency count. Estimating at
about 25 bytes per word, that represents 8 million words.
rawdata.txt is the raw data file (93MB)
I have a counter in the code to tell me how long it's taking to
process... 9500 hours or so to complete... I definitely have time to
pursue other alternatives.
Scripting with perl is a hobby and not a vocation so I apologize in
advance for ugly code. Any suggestions/comments would be greatly
appreciated.
Thanks,
Dan
========================
You should have
use strict;
use warnings;
in your program. This is very important if you wish to get help from
this newsgroup.
print "**fisher**";
$flatfile = "newrawdata.txt";
# 95MB in size
$datafile = "hashsequence16.txt";
# 203MB in size
my $filesize = -s "hashsequence16.txt";
# for use in processing time calculation
open(FILE, "$flatfile") || die "Can't open '$flatfile': $!\n";
open(FILE2, "$datafile") || die "Can't open '$flatfile': $!\n";
open (SEQFILE, ">fishersearch.txt") || die "Can't open '$seqparsed': $!
\n";
You should be using lexically-scoped file handle variables, the
3-argument version of open, and 'or' instead of '||'.
@preparse = <FILE>;
@hashdata = <FILE2>;
Well at least you have enough memory to read the files into memory.
That helps. If you apply the chomp operator to these arrays, you can
save yourself some repetitive processing later:
chomp(@preparse);
chomp(@hashdata);
close(FILE);
close(FILE2);
for my $list1 (@hashdata) {
# iterating through hash16 data
$finish++;
if ($finish ==10 ) {
# line counter
$marker = $marker + $finish;
$finish =0;
$left = $filesize - $marker;
printf "$left\/$filesize\n";
# this prints every 17 seconds
}
When you are asking for help, it is best to leave out irrelevant
details such as periodic printing statements. It doesn't help anybody
help you.
($line, $freq) = split(/\t/, $list1);
for my $rawdata (@preparse) {
# iterating through rawdata
$rawdata=~ s/\n//;
No need for this if you chomp the arrays after reading.
if ($rawdata =~ m/$line/) {
# matching hash16 word with rawdata line
my $first_pos = index $rawdata,$line;
You first use a regex to find if $line appears in $rawdata, then use
index to find out where it appears. Just test the return value from
index to see if the substring appears. It will be -1 if it does not.
This will give you a significant speed-up.
print SEQFILE "$first_pos\t$rawdata\n";
# printing to info to new file
}
}
print SEQFILE "PROCESS\t$line\n";
# printing hash16 word and "process"
}
You only make one pass through FILE2, so you can save some memory by
processing the contents of this file one line at a time, instead of
reading it into the @hashdata array. It looks like you could also swap
the order of the for loops and only make one pass through FILE,
instead, but that may take more memory.
It is difficult to see why this program will take 9500 hours to run.
Make the above changes and try again. Without your data files or a look
at some sample data, it is difficult for anyone to really help you.
--
Jim Gibson
.
- References:
- File size too big for perl processing
- From: Cheez
- File size too big for perl processing
- Prev by Date: FAQ 8.39 How do I set CPU limits?
- Next by Date: Re: NDBM support
- Previous by thread: Re: File size too big for perl processing
- Next by thread: Re: File size too big for perl processing
- Index(es):
Relevant Pages
|