Re: Code speedup

From: Carlie J. Coats (carlie_at_jyarborough.com)
Date: 05/11/04


Date: Tue, 11 May 2004 01:29:50 GMT

Richard Maine wrote:
> lindahl@pbm.com (Greg Lindahl) writes:
>
>>Your intuition can often be wrong...
>
> I'd agree with that observation here. It isn't to hard to force
> statically-sized arrays to be allocated adjacently, using COMMON
> or other tricks...but my intuition says that is unlikely to
> have much effect on speed in most cases...[snip...]

What _is_ true is that for microprocessor based shared memory systems,
for programs that simultaneously address several large similarly
dimensioned arrays, it *can* yield a substantial performance
improvement to force alignment with padding that minimizes cache-tag
overlap. For example, if you have arrays

    REAL, DIMENSION(4096,4096):: A, B, C, D
    REAL, EXTERNAL:: F

and your code does something like the following:

!$OMP PARALLEL DO SHARED(A, B, C, D ), PRIVATE( I, J )
     DO J = 1, 4096
     DO I = 1, 4096
         A(I,J) = F( B(I,J), C(I,J), D(I,J) )
     END DO
     END DO

Then , on 4 processors, with "chunk" scheduling, so that the first
processor processes J in the range [1:1024], etc., the processors
will be trying to simultaneously update elements of A that are
exactly 16 megabytes apart. Cache-tag addresses for A(I,J) are
taken from a segment of middle-order bits in the address of A(I,J),
so almost certainly A(I,J), A(I,J+1024), A(I,J+2048), and A(I,J+3072)
share the same cache tag. Consequently, the processors will "fight"
over which of these is allowed to be in cache at any one time.

In this case, padding the J-dimension of A may change this behavior
so that the contention no longer happens. In more complex (i.e.,
realistic) cases with several arrays on the LHS, it is an interesting
application of the Chinese Ramainder Theorem from Number Theory
to compute what dimensioning for padding-arrays is necessary to
minimize the cache-tag contention.

In extreme cases, this may affect performance by a factor of 10.
Some compilers will perform the first of these optimizations if
requested; I don't know of any that will do the second...

-- 
Carlie J. Coats, Jr.                        carlie.coats@baronams.com
Environmental Modeling Center                    phone: (919)248-9241
Baron Advanced Meteorological Systems              fax: (919)248-9245
3021 Cornwallis Road                                 P. O. Box 110064
Research Triangle Park, N. C.  27709-5064                         USA
"My opinions are my own, and I've got *lots* of them!"


Relevant Pages

  • Re: pointer conversion
    ... Arrays have to be contiguous in memory. ... A struct foo, on a machine with sizeof int == 4, will be occupy 5 ... I.E. there are 3 padding bytes per item. ...
    (comp.lang.c)
  • Re: byte order
    ... casting objects to arrays of bytes, and guarantees that such an array ... The standard also gives the possibility of integer types containing ... padding bits, and it is not specified how such bits can be detected. ... code that cares about endianness, so it's not really an issue. ...
    (comp.lang.lisp)
  • Re: byte order
    ... casting objects to arrays of bytes, and guarantees that such an array ... The standard also gives the possibility of integer types containing ... padding bits, and it is not specified how such bits can be detected. ... Bound variables, free programmers. ...
    (comp.lang.lisp)
  • Re: multidimensional arrays and cute tricks
    ... >>now you can see that the memcpy was originally tried, ... C has arrays whose elements are in turn arrays. ... > struct, and that struct would need padding between adjacent array ... the padding must be part of the struct type. ...
    (comp.lang.c)