Re: Question about intel_VEC_memcpy
- From: Tim Prince <tprince@xxxxxxxxxxxxxxxxxx>
- Date: Tue, 19 Aug 2008 09:29:27 -0700
Paul van Delst wrote:
Tim Prince wrote:Paul van Delst wrote:Hello,As its name implies, the function would be a replacement for C memcpy(), and would simply copy data. A possible reason for heavy usage would be excessive use of temporary arrays, particularly if these are large enough to incur cache misses. Syntax such as array assignment and matmul() is highly productive of temporaries, some of which could be avoided by better optimization in the compiler.
We are profiling some code on a linux cluster using Intel 10.0 and the first
couple of lines we are seeing are:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
76.81 103.64 103.64 __intel_VEC_memcpy
13.24 121.50 17.86 209720 0.00 0.00 crtm_atmabsorption_mp_crtm_compute_atmabsorption_
2.20 124.47 2.97 exp.J
1.64 126.68 2.21 209720 0.00 0.00 crtm_atmoptics_mp_crtm_combine_atmoptics_
1.14 128.22 1.54 log.J
1.03 129.61 1.39 209720 0.00 0.00 crtm_rtsolution_mp_crtm_compute_rtsolution_
0.73 130.59 0.99 20114380 0.00 0.00 crtm_planck_functions_mp_crtm_planck_radiance_
...etc...
Can anyone knowledgable (SteveL? :o) provide a bit of info about what this
procedure does and how we can avoid its heavy use. I realise that last request
is unrealistic - but I'm just looking for rules of thumb, nothing too specific.
The code in question uses structures heavily with all of their components being
pointers (as they have to be allocatable).
My current working theory is that we are allocating all our structures in such a
way as to cause memory fragmentation so the final compiled executable has to hunt
all over the (memory) map to find the data it needs to actually do calculations.
All suggestions (code changes and compiler switches) welcome.
We have highlighted some of these areas (particularly matmul usage) in the code. And we recently introduced a feature in our code that does do routine array assignment.
How do other compilers do?
If you are willing to work with a current version of ifort, and to submit a case to Intel support, there is likely to be scope for improvement.
Oh, I'm sure the problem is in our code, or in the switches we're using to compile, not the intel compiler. If I gave that impression, I apologise. A much earlier version of the code that was purely array based was ~7x faster (same compiler and platform, run in the same test suite). Basically, once you subtract the time for the memcpy in the newer version, the times were comparable.
A just-off-the-press run with g95 (don't know whch version, but assume 0.9) ran twice as fast as the intel executable, so I think we need to look a bit closer at the intel compiler switches we're using. Currently we have a very simple set:
FC_FLAGS= -c \
-O2 \
-convert big_endian \
-warn errors \
-free \
-assume byterecl
and
FL_FLAGS= -static-libcxa \
-o
For g95 our compile switches are
FC_FLAGS= -c \
-O2 \
-fendian=big \
-ffast-math \
-ffree-form \
-fno-second-underscore \
-funroll-loops \
-malign-double \
-std=f95
which are a bit more aggressive so I don't think the intel/g95 comparison I mentioned above is fair (to the intel result).
Those options are reasonably comparable.
As you have identified matmul() usage as a possible problem, I will comment on that:
If the matmul result is assigned directly to an array, e.g.
result = matmul(arg1,arg2)
and arg1 and arg2 don't involve sparsity (explicit strides, etc.),
an optimizing compiler ought not to make a hidden temporary array, in my opinion. If it does so, I would suggest a problem report.
In the case where matmul is used in an expression, e.g.
result = result + matmul(arg1,arg2)*scalar
a compiler can't avoid the allocation of a temporary array for the intermediate result. In this case, if the matrix is at all large (20x20 or more), BLAS ?GEMM (called directly, not via a matmul wrapper such as the one in gfortran) is a better choice. The matmul() temporary array can slow it down significantly.
I don't know whether blas95 would be efficient; it may be, particularly with interprocedural optimization.
.
- Follow-Ups:
- Re: Question about intel_VEC_memcpy
- From: Paul van Delst
- Re: Question about intel_VEC_memcpy
- References:
- Question about intel_VEC_memcpy
- From: Paul van Delst
- Re: Question about intel_VEC_memcpy
- From: Tim Prince
- Re: Question about intel_VEC_memcpy
- From: Paul van Delst
- Question about intel_VEC_memcpy
- Prev by Date: Re: Surprise
- Next by Date: Re: unicode in fortran
- Previous by thread: Re: Question about intel_VEC_memcpy
- Next by thread: Re: Question about intel_VEC_memcpy
- Index(es):
Relevant Pages
|
|