Re: Code speedup
From: Carlie J. Coats (carlie_at_jyarborough.com)
Date: 05/11/04
- Next message: James Van Buskirk: "Re: function prototypes"
- Previous message: James Van Buskirk: "Re: maxval,product, etc with mask all false"
- Next in thread: Gerry Thomas: "Re: Code speedup"
- Reply: Gerry Thomas: "Re: Code speedup"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 11 May 2004 01:29:50 GMT
Richard Maine wrote:
> lindahl@pbm.com (Greg Lindahl) writes:
>
>>Your intuition can often be wrong...
>
> I'd agree with that observation here. It isn't to hard to force
> statically-sized arrays to be allocated adjacently, using COMMON
> or other tricks...but my intuition says that is unlikely to
> have much effect on speed in most cases...[snip...]
What _is_ true is that for microprocessor based shared memory systems,
for programs that simultaneously address several large similarly
dimensioned arrays, it *can* yield a substantial performance
improvement to force alignment with padding that minimizes cache-tag
overlap. For example, if you have arrays
REAL, DIMENSION(4096,4096):: A, B, C, D
REAL, EXTERNAL:: F
and your code does something like the following:
!$OMP PARALLEL DO SHARED(A, B, C, D ), PRIVATE( I, J )
DO J = 1, 4096
DO I = 1, 4096
A(I,J) = F( B(I,J), C(I,J), D(I,J) )
END DO
END DO
Then , on 4 processors, with "chunk" scheduling, so that the first
processor processes J in the range [1:1024], etc., the processors
will be trying to simultaneously update elements of A that are
exactly 16 megabytes apart. Cache-tag addresses for A(I,J) are
taken from a segment of middle-order bits in the address of A(I,J),
so almost certainly A(I,J), A(I,J+1024), A(I,J+2048), and A(I,J+3072)
share the same cache tag. Consequently, the processors will "fight"
over which of these is allowed to be in cache at any one time.
In this case, padding the J-dimension of A may change this behavior
so that the contention no longer happens. In more complex (i.e.,
realistic) cases with several arrays on the LHS, it is an interesting
application of the Chinese Ramainder Theorem from Number Theory
to compute what dimensioning for padding-arrays is necessary to
minimize the cache-tag contention.
In extreme cases, this may affect performance by a factor of 10.
Some compilers will perform the first of these optimizations if
requested; I don't know of any that will do the second...
-- Carlie J. Coats, Jr. carlie.coats@baronams.com Environmental Modeling Center phone: (919)248-9241 Baron Advanced Meteorological Systems fax: (919)248-9245 3021 Cornwallis Road P. O. Box 110064 Research Triangle Park, N. C. 27709-5064 USA "My opinions are my own, and I've got *lots* of them!"
- Next message: James Van Buskirk: "Re: function prototypes"
- Previous message: James Van Buskirk: "Re: maxval,product, etc with mask all false"
- Next in thread: Gerry Thomas: "Re: Code speedup"
- Reply: Gerry Thomas: "Re: Code speedup"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|