Re: Improving multi-dimension array access performance



Rajorshi Biswas wrote:

We have a large fortran code base which uses quite a lot of four
dimensional arrays. Upon profiling the performance, we've found that
the access times on these arrays is quite significant.

For instance, assume we have an array: arr(3,200,200,200). Our
question is:

a) Is there any well-known way of optimizing access to
such arrays in Fortran?

First, make sure that loops are nested such that the inner loops
correspond to the leftmost subscripts, if possible.

In cases where it isn't possible, such as matrix multiply, which goes
through arrays in different directions, doing it in blocks can help.

Loop unrolling can also help.

In many cases, the solutions aren't so general but require understanding
the specific details of the problem.

The subscript calculation should be approximately proportional
to the number of subscripts, and on most modern machines should
be faster than array access (especially store) that the actual
number of subscripts isn't the problem. Any operation on a
24 million element array will be slow if not done in the right order.

b) If we "unroll" the array arr into arr1(of dimensions 200x200x200),
arr2 and arr3, would the speed of access improve dramatically?

If you access all three in the same loop, the change won't be dramatic.

If by "unroll" you separate the arrays and loops it might be.

We are using Intel Fortran compilers, in case that matters.

Can you post the set of nested loops accessing the array?

Otherwise, testing with the following program shows that the
accessing a six dimension array takes about twice as long as
a one dimension array.

Accessing the six dimension array in the wrong order (the
second set of nested loops) takes three times as long as the
right order, or six times as long as the single loop.

real x(10,10,10,10,10,10), y(1000000),xx,yy
integer i1,i2,i3,i4,i5,i6,i
integer*8 rdtsc,t0,t1,t2,t3
call random_number(x)
y=transfer(x,y)
xx=0
yy=0
t0=rdtsc(0)
do i1=1,10
do i2=1,10
do i3=1,10
do i4=1,10
do i5=1,10
do i6=1,10
xx=xx+x(i6,i5,i4,i3,i2,i1)
enddo
enddo
enddo
enddo
enddo
enddo
t1=rdtsc(0)
do i=1,1000000
yy=yy+y(i)
enddo
t2=rdtsc(0)
print *,xx,yy
print *,y(12346),x(6,5,4,3,2,1)
print *,t1-t0,t2-t1
xx=0
yy=0
t0=rdtsc(0)
do i1=1,10
do i2=1,10
do i3=1,10
do i4=1,10
do i5=1,10
do i6=1,10
xx=xx+x(i1,i2,i3,i4,i5,i6)
enddo
enddo
enddo
enddo
enddo
enddo
t1=rdtsc(0)
do i=1000000,1,-1
yy=yy+y(i)
enddo
t2=rdtsc(0)
print *,xx,yy
print *,y(12346),x(6,5,4,3,2,1)
print *,t1-t0,t2-t1
end

rdtsc.s contains:

.file "rdtsc.f"
.text
.p2align 4,,15
..globl rdtsc_
.type rdtsc_, @function
rdtsc_:
rdtsc
ret
.size rdtsc_, .-rdtsc_

(assuming an IA32 processor)

-- glen

.