Re: File size too big for perl processing



Cheez <danieldharkness@xxxxxxxxx> wrote:
Hi, I posted this to perl.beginners as well and will make sure
comments go to both groups.

I have a big file of 16-letter words that I am using as "bait" to
capture larger words in a raw data file. I loop through all of the
rawdata with a single word for 1) matches and 2) to associate the raw
data with the word. I then go to the next line in the word list and
repeat.

hashsequence16.txt is the 16-letter word file (203MB)

How many lines? (it seems the 16-letter part is only the first column
of the file, so it is not simply 203MB / 17bytes)

rawdata.txt is the raw data file (93MB)

How many lines is it?


I have a counter in the code to tell me how long it's taking to
process... 9500 hours or so to complete... I definitely have time to
pursue other alternatives.

Scripting with perl is a hobby and not a vocation so I apologize in
advance for ugly code.

As a hobbyer, you should have the leisure to make it less ugly, while
someone working under the clock might not!


open(FILE, "$flatfile") || die "Can't open '$flatfile': $!\n";
open(FILE2, "$datafile") || die "Can't open '$flatfile': $!\n";

Wrong variable in the die, $datafile not $flatfile


@preparse = <FILE>;
@hashdata = <FILE2>;

Do you have a lot of memory, or is your system swapping? If swapping,
that right there will slow it down dramatically. In this case, if you
change the outer foreach to a while (<FILE>), that might make things
better.


close(FILE);
close(FILE2);

for my $list1 (@hashdata) {
$finish++;
if ($finish ==10 ) {
$marker = $marker + $finish;
$finish =0;
$left = $filesize - $marker;

$filesize is in bytes, while $marker is in lines. This isn't gonna give
meaningful information.


printf "$left\/$filesize\n";
# this prints every 17 seconds
}

($line, $freq) = split(/\t/, $list1);

for my $rawdata (@preparse) {
$rawdata=~ s/\n//;

This substitution only needs to be done once, not for every @hashdata.
Put "chomp @preparse" outside of the loop.


if ($rawdata =~ m/$line/) {

In my test case, I had to add \Q before $line, otherwise the odd
special character in it caused regex syntax errors.

my $first_pos = index $rawdata,$line;

On success, you are doing the search twice. If success is rare, then
of course this is not important speedwise. Get rid of one or the other,
I'd prefer to get rid of the regex and do only the index.


Anyway, I'd write it to load hashdata into a hash (surprise!), and then
probe a 16 byte sliding window of newdata against that hash.

my %hashdata;
while (<FILE2>) {
chomp;
my ($t)=split /\t/;
$hashdata{$t}=();
};
close(FILE2);
my ($finish,$marker,$left);
while (my $rawdata=<FILE>) {
chomp $rawdata;
foreach (0..(length $rawdata) - 16) {
if (exists $hashdata{substr $rawdata,$_,16}) {
print SEQFILE "$_\t$rawdata\n";
}
}
}

The whole thing takes about a minute on files of about the size you
specified.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
.



Relevant Pages