Confused about benchmarks calculating averages of large arrays.



At the end of this message is a program that calculates benchmarks for
3 different methods of calculating the mean of large arrays of data.
20 arrays containing 200,000 floats each are summed and averaged, per-
element, producing an array of 200,000 floats as output.

The difference between the 3 methods is the order in which operations
are applied.

Method 1: Add all of array 0 to output, then all of array 1, then all
of array 2, etc (this is done 20 times). Then, divide output by 20.

Method 2: Compute sum of first element of each array, then second
element of each array, one element at a time (this is done 200,000
times). Then, divide output by 20.

Method 3: Similar to method 1 but add array/20.0 each time, and don't
do separate division at the end.

The results of the benchmarks are confusing to me and I am wondering
if somebody could clear up why. The reason I am testing this is
because I have an application where I must compute the means of many
large arrays, and I'm trying to squeeze a few extra CPU cycles out.

The code below leaves as many optimizations up to the compiler as
possible. It does not explicitly use any special instruction sets
(e.g. not SSE-optimized), all arrays are accessed with [] operator,
etc.

Here are the results on an 2.16 Ghz Intel T2600 with lots of RAM. The
values are "iterations / second" (number of complete means per
second). The 3 values correspond to the compiler commands "gcc -O0",
"gcc -O2", and "gcc -O2 -funroll-loops", respectively (method 3 is, of
course, the slowest, performing far more divisions than any of the
other methods):

Method 1: 35.1, 96.2, 101.7
Method 2: 35.1, 54.0, 98.2
Method 3: 14.3, 14.4, 14.3

Now, here are the results using arrays of doubles instead of floats:

Method 1: 32.2, 43.7, 44.6
Method 2: 29.5, 42.7, 70.4
Method 3: 14.1, 14.3, 14.3

Here are my questions, if anybody has any insights:

1) Looking at floats; why does method 2 perform so much worse than
method 1? Both are performing the same number of operations (I think),
the biggest difference is the order the input data is accessed in
(method 1 accesses inputs[0][0], inputs[0][1], inputs[0][2], ...,
method 2 does it like inputs[0][0], inputs[1][0], inputs[2][0], ...).
I suspect it has something to do with loop overhead, since -funroll-
loops eliminated any difference between the two methods.

2) Why are the relative results for doubles so different? I expected
the timings to be slower, of course. But unlike floats, -O2 alone did
not create a significant difference between method 1 and 2. With both
floats and doubles only method 2 was significantly affected by loop
unrolling, but with doubles method 2 is nearly twice as fast as method
1 -- whereas with floats they perform roughly the same. What is going
on?

3) Why is method 3 unaffected by data type or compiler optimizations?
I suspect it is because the overhead of the extra divisions is so
large that it makes other speedups negligible, but I am not sure.

4) How much of this might be unique to my machine? I.e. if I use
method 1 with floats and method 2 with doubles, am I probably going to
get the best performance on any machine (again without explicitly
using any special instruction sets).

I know very little about the intricacies of optimizing code as far as
taking advantage of hardware goes, so a lot of this doesn't make much
sense to me.

The program is below.

Thanks a lot,
Jason

===== BEGIN =====

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef float elem_t;

#define ELEMS 200000
#define BUFS 20
#define ITERS 100

elem_t inputs[BUFS][ELEMS];
elem_t output[ELEMS];
elem_t dummy = 0.0f; /* so compiler doesn't optimize unused stuff
away. */

/*----- timer code appropriate for windows --------------*/

#include <windows.h>

LARGE_INTEGER tfreq, ttick;

void tick (void) {
QueryPerformanceFrequency(&tfreq);
QueryPerformanceCounter(&ttick);
}

void tock (const char *name) {
double dt;
LARGE_INTEGER now;
QueryPerformanceCounter(&now);
dt = (double)(now.QuadPart - ttick.QuadPart) /
(double)tfreq.QuadPart;
printf("%s: %f sec, %f iters/sec, %f sec/iters\n",
name, dt, (double)ITERS/dt, dt/(double)ITERS);
}

/*-------------------------------------------------------*/

void average1 (void) {
int b, n;

/* calculate sum, one input at a time */
memcpy(output, inputs[0], sizeof(output));
for (b = 1; b < BUFS; ++ b)
for (n = 0; n < ELEMS; ++ n)
output[n] += inputs[b][n];

/* then calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] /= (elem_t)BUFS;

dummy += output[0];

}

void average2 (void) {
int b, n;

/* calculate sum, one element at a time */
memcpy(output, inputs[0], sizeof(output));
for (n = 0; n < ELEMS; ++ n)
for (b = 1; b < BUFS; ++ b)
output[n] += inputs[b][n];

/* then calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] /= (elem_t)BUFS;

dummy += output[0];

}

void average3 (void) {
int b, n;

/* calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] = (inputs[0][n] / (elem_t)BUFS);
for (b = 1; b < BUFS; ++ b)
for (n = 0; n < ELEMS; ++ n)
output[n] += (inputs[b][n] / (elem_t)BUFS);

dummy += output[0];

}

/*-----------------------------------*/

void timeit (const char *name, void(*fn)(void) ) {
int n;

tick();
for (n = 0; n < ITERS; ++ n)
fn();
tock(name);

}

int main (void) {
int b, e, r;

for (b = 0; b < BUFS; ++ b)
for (e = 0; e < ELEMS; ++ e)
inputs[b][e] = 20.0f * (elem_t)rand() / (elem_t)RAND_MAX - 10.0f;

for (r = 0; r < 3; ++ r) {
timeit("average1", average1);
timeit("average2", average2);
timeit("average3", average3);
}

printf("%f\n", dummy);
return 0;

}
.



Relevant Pages