Re: Performance Improvement of complex data structure (hash of hashes of hashes)

From: Scott Gilpin (sgilpin_at_gmail.com)
Date: 08/25/04


Date: 25 Aug 2004 09:49:29 -0700

Anno Siegel wrote:
> Scott Gilpin <sgilpin@gmail.com> wrote in comp.lang.perl.misc:
> > Hi everyone -
> >
> > I'm trying to improve the performance (runtime) of a program that
> > processes large files. The output of the processing is some fixed
> > number of matrices (that can vary between invocations of the
program),
> > each of which has a different number of rows, and the same number
of
> > columns. However, the number of rows and columns may not be known
> > until the last row of the original file is read. The original file
> > contains approximately 100 millon rows. Each individual matrix has
> > between 5 and 200 rows, and between 50 and 10000 columns. The data
> > structure I'm using is a hash of hashes of hashes that stores this
> > info. N is the total number of columns, M1 is the total number of
> > rows in matrix #1, M2 is the total number of rows in matrix 2, etc,
> > etc. The total number of matrices is between 3 and 15.
>
> [...]
>
> > Here is the code that I'm using to build up this data structure.
I'm
> > running perl version 5.8.3 on solaris 8 (sparc processor). The
system
> > is not memory bound or cpu bound - this program is really the only
> > thing that runs. There are several gigabytes of memory, and this
> > program doesn't grow bigger than around 100 MB. Right now the run
time
> > for the following while loop with 100 million rows of data is about
6
> > hours. Any small improvements would be great.
>
> It shouldn't take that long, unless the data structure blows up way
> beyond 100 MB.
>
> > ## loop to process each row of the original data
> > while(<INDATA>)
> > {
> > chomp($_);
> >
> >
> > ## Each row is delimited with |
> > my @original_row = split(/\|/o,$_);
> >
> > ## The cell value and the column name are always in the same
> > position
> > my $cell_value = $original_row[24];
> > my $col_name = $original_row[1];
> >
> > ## Add this column name to the list of ones we've seen
> > $columns_seen{$col_name}=1;
>
> Where is this used?
>
> > ## For each matrix, loop through and increment the
> > row/column value
> > foreach my $matrix (@matrixList)
>
> Where is @matrixList set?
>
> > {
> >
> > ## positionHash tells the position of the value for
> > ## this matrix in the original data row
> > my $row_name = $original_row[$positionHash{$matrix}];
>
> Where is %positionHash set?
>
> > $matrix_values{$matrix}{$row_name}{$col_name} +=
> > $cell_value;
> > }
> >
> > } ## end while
>
> This code isn't runnable. How are we to improve code we can't run?

Thanks for your reply. I apologize for not being more clear in my
original post. I've included the entire code to produce the desired
output. The first 10 lines of the input file are:

928219|7|6|MI|2
928219|9|5|MO|1
928219|11|5|CA|41
928219|8|6|MA|1
928219|5|5|WY|3
701396|10|7|QC|8
701396|17|1|MI|1
928219|0|3|CA|2
701396|13|1|CA|2
928219|1|1|CA|2

The header is:

col_name|matrix1_rows|matrix2_rows|matrix3_rows|cell_values

The source code is:

#!/usr/local/bin/perl5.8.3

use strict;

## The list of matrices actually varies between invocations
## of the program - anywhere from 3 - 15
my @matrixList = ("matrix1", "matrix2", "matrix3");

## Position hash has the row positions of the values for each matrix
my %positionHash = (matrix1 => 1, matrix2 => 2, matrix3 => 3);

## Keep track of the columns we've seen so far
my %columns_seen = ();

## hash of hashes of hashes - matrix => rows => columns
my %matrix_values = ();

open (INDATA, "data.txt") ||
die "can't open data.txt";

while(<INDATA>) {

chomp($_);

## Each row is variable width, delimited with |
my @original_row = split(/\|/,$_);

## The cell value and the column name are always in the same
## position
my $cell_value = $original_row[4];
my $col_name = $original_row[0];

## Add this column name to the list of ones we've seen
$columns_seen{$col_name}=1;

## For each matrix, loop through and increment the
## row/column value
foreach my $matrix (@matrixList) {

my $row_name = $original_row[$positionHash{$matrix}];
$matrix_values{$matrix}{$row_name}{$col_name} += $cell_value;
}

} ## end while

## The following code runs very quicky compare to the
## while loop above (2 mins vs. 6 hrs)
## I'm only including it to produce the desired output

## Create a header row with column names that is the same
## for all matrices
my $header = "";

foreach my $col_name (sort keys %columns_seen) {
$header = $header . "," . "$col_name";
}

## Create a file for each separate matrix
foreach my $matrix (@matrixList) {

## Open output file
my $OUT_FILE = $matrix . ".csv";
open (OUTFILE, ">$OUT_FILE") || die "can't open $OUT_FILE";

## Now we create the first line of file.
## Starting with the matrix name and a comma.
## Then printing out the column names.

my $firstline = $matrix . "$header";
print OUTFILE "$firstline\n";

## Loop for each row in the matrix
foreach my $row_name (keys(%{$matrix_values{$matrix} } )) {

my $line = $row_name;

## Loop for each column in the matrix
foreach my $col_name (sort keys %columns_seen) {

my $cell_value;
if ($matrix_values{$matrix}{$row_name}{$col_name}) {
$cell_value =
$matrix_values{$matrix}{$row_name}{$col_name};
} else {
$cell_value = ".";
}
$line = $line . ",$cell_value";
}
print OUTFILE "$line\n";

}
close OUTFILE;
}
>
> To make it runnable, I had to realize that %positionHash is nowhere
> set and come up with a plausible one. Same for @matrixList. I had
> to find that %columns_seen is nowhere used, and discard it. Then I
> had to generate a set of input data of for the program to run with.
> It would have been your job to do that, and you are far better
equipped
> to do it.
>
> That said, I don't see room for fundamental improvement. Apparently
> each "cell value" contributes to all matrices in the same column,
> but in lines that are determined indirectly (though %positionHash).
>
> You program does that without doing any excessive extra work. There
> may be re-arrangements of the data structure with corresponding code
> adaptions that make it marginally faster, but I wouldn't expect
> anything better than 10%.
>
> > I tried using DProf & dprofpp, but that didn't reveal anything
> > interesting.
>
> It can't. DProf works on subroutine basis, but your code doesn't
> use any subroutines.
>
> > I also tried setting the initial size of each hash
using
> > 'keys', but this didn't show any improvement. I could only
initialize
> > the hash of hashes - and not the third level of hashes (since I
don't
> > know the values in the second hash until they are read in from the
> > file). I know that memory allocation in C is expensive, as is
> > re-hashing - I suspect that's what's taking up a lot of the time.
>
> One thing to observe is whether program speed deteriorates over
> time. Just print something to stderr every so-many records and
> watch the rhythm. If it gets slower with time, the problem is most
> likely memory related. If it doesn't, you're none the wiser.

The runtime scales linearly with the number of rows, so
I don't believe it to be a memory issue.

>
> Anno



Relevant Pages

  • Re: Fast string operations
    ... > is why people use unsafe code now and then to use pointer arithmetic to ... > loop over arrays without all the unnecessary bounds checking. ... don't return the trimmed string". ... The customer perceives this as a memory leak. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Cost of calling a standard library function
    ... It accesses/reads memory using esi 4 ... > safly move it within the cache, without having to go via ebx. ... try it the same thing on a different earlier CPU, ... should check it out...for "tight inner loop" stuff, ...
    (alt.lang.asm)
  • Re: Restoring a NorthStar Horizon, problems with SRAM board
    ... A-D FDC is a memory mapped IO device so I would expect some behavior ... loop read/write functions - which puts the system into a two instruction ... SRAM on the Vector Graphics Flashwriter video card located at $F000 ...
    (comp.sys.northstar)
  • Re: Fast string operations
    ... worried about string performance. ... > is why people use unsafe code now and then to use pointer arithmetic to loop ... > I'm aware that looping has to occur one way or the other, but with Trim, ... The customer perceives this as a memory leak. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Bimini Twist in 50lb Power Pro
    ... > fishing I do not normally choose to use any loop to loop connection ... > my bass taper fly line and continuous mono leader. ... Does Power Pro retain memory? ... I've never experienced any memory with PP, whether using it in fresh water ...
    (rec.outdoors.fishing.bass)