Re: Sun Studio Express 3 compilers available for download



ejko123@xxxxxxxxx wrote:
robert.corbett@xxxxxxx wrote:
I am unaware of any effective algorithms for optimizing a general
array expression. A superoptimizer approach might work, but I'm
guessing that most people would like their optimized compilations
to finish before the heat death of the universe.

This is probably much more than can be adequately discussed here,
but I'd like to understand why this is hard. Is it hard because of
the register allocation problem, the array temporary problem, or
something else? Are there some good review papers on the subject?

--Eric

I'll take a swing at listing 3 or 4 problems in no particular order.

1) In production codes, essentially all array operations are performed
in subroutines on arrays with passed in dimensions. The compiler has
no size information to help it decide what to do. For small arrays,
loop unrolling is often the best choice. For large arrays, some sort
of cache friendly blocking algorithm is often the best choice. IN
general, in a particular subroutine, there are only a few array sizes.
Given something like
REAL :: X(:), Y(:), Z(:), T(:)
the odds are that X, Y, Z and T have the same size. But, the compiler
doesn't know this (or at least has to work hard to prove it) so code like
X = 0
Y = 0
is likely to generate 2 separate DO loops, rather than one.

2) Dependency analysis, whether or not the RHS needs to be computed
into a temporary array or can go directly into the LHS is hard.
Substrings are usually written as (J:K), not (1:100). Array
expressions of the form
A(1:10) = A(2:11) + 1
or
A(2:11) = A(1:10) + 1
can always be compiled with a temporary array. But, they often can
be optimized into DO I = 1,10 or DO I = 11,2,-1 loop without using
any temporary storage. For relatively simple expressions, use of
temporaries more or less doubles execution times. If the array is large
and the compiler can't resolve the dependencies, it's likely to overflow
the cache with temporary arrays.

3) Many arrays are multidimensional. Given something like
A(J:K,L:M) = 0
the compiler is likely to generate code equivalent to
DO I = L,M
DO N = J,K
A(N,I) = 0
ENDDO**2

If the dimensions of A are [1000000000000000,2] that's good; if they're
[2,100000000000000000000] the DO overhead in the inner loop dominates
everything.

3.5) Maybe the compiler can prove that J:K,L:M runs over the entire
array, in which case it can use a single equivalent DO loop with no
need to worry about loop order.

4) 2 and 3 work together. Given an example like
A(I:J,K:L) = A(M:N,K:L) + 1
should the compiler try to run the first subscript in the inner loop,
which is cache friendly but probably needs a temporary, or should it
run the second subscript in the inner loop which needs no temps, but
isn't necessarily cache friendly.

5) Within a sequence of a few lines, most variables are used more than
once. Variables on the RHS are often used in a few different
expressions. LHS variables usually appear on the right hand side within
the next few lines (why compute anything if you're never going to use
it ;) ) To generate good code, compilers try to combine adjacent lines
into a single equivalent nest of DO loops. This allows things to be
kept in registers, but makes the dependency analysis problems worse.
Should you combine three array assignments into a single loop and
(maybe) get better cache or register usage even if it means you need
to generate a temporary? What's the best loop ordering if multiple
assignments are present? Can you even combine two assignments into
a single loop? First the compiler has to prove that the two left
hand sides have the same size or shape.

Anyhow, that's my guess at some of the potential problems.

*** Hendrickson
.