Re: Question about intel_VEC_memcpy



Paul van Delst wrote:
Tim Prince wrote:
Paul van Delst wrote:
Hello,

We are profiling some code on a linux cluster using Intel 10.0 and the first
couple of lines we are seeing are:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
76.81 103.64 103.64 __intel_VEC_memcpy
13.24 121.50 17.86 209720 0.00 0.00 crtm_atmabsorption_mp_crtm_compute_atmabsorption_
2.20 124.47 2.97 exp.J
1.64 126.68 2.21 209720 0.00 0.00 crtm_atmoptics_mp_crtm_combine_atmoptics_
1.14 128.22 1.54 log.J
1.03 129.61 1.39 209720 0.00 0.00 crtm_rtsolution_mp_crtm_compute_rtsolution_
0.73 130.59 0.99 20114380 0.00 0.00 crtm_planck_functions_mp_crtm_planck_radiance_
...etc...

Can anyone knowledgable (SteveL? :o) provide a bit of info about what this
procedure does and how we can avoid its heavy use. I realise that last request
is unrealistic - but I'm just looking for rules of thumb, nothing too specific.

The code in question uses structures heavily with all of their components being
pointers (as they have to be allocatable).

My current working theory is that we are allocating all our structures in such a
way as to cause memory fragmentation so the final compiled executable has to hunt
all over the (memory) map to find the data it needs to actually do calculations.

All suggestions (code changes and compiler switches) welcome.

As its name implies, the function would be a replacement for C memcpy(), and would simply copy data. A possible reason for heavy usage would be excessive use of temporary arrays, particularly if these are large enough to incur cache misses. Syntax such as array assignment and matmul() is highly productive of temporaries, some of which could be avoided by better optimization in the compiler.

We have highlighted some of these areas (particularly matmul usage) in the code. And we recently introduced a feature in our code that does do routine array assignment.

How do other compilers do?
If you are willing to work with a current version of ifort, and to submit a case to Intel support, there is likely to be scope for improvement.

Oh, I'm sure the problem is in our code, or in the switches we're using to compile, not the intel compiler. If I gave that impression, I apologise. A much earlier version of the code that was purely array based was ~7x faster (same compiler and platform, run in the same test suite). Basically, once you subtract the time for the memcpy in the newer version, the times were comparable.

A just-off-the-press run with g95 (don't know whch version, but assume 0.9) ran twice as fast as the intel executable, so I think we need to look a bit closer at the intel compiler switches we're using. Currently we have a very simple set:

FC_FLAGS= -c \
-O2 \
-convert big_endian \
-warn errors \
-free \
-assume byterecl

and

FL_FLAGS= -static-libcxa \
-o

For g95 our compile switches are

FC_FLAGS= -c \
-O2 \
-fendian=big \
-ffast-math \
-ffree-form \
-fno-second-underscore \
-funroll-loops \
-malign-double \
-std=f95

which are a bit more aggressive so I don't think the intel/g95 comparison I mentioned above is fair (to the intel result).

Those options are reasonably comparable.
As you have identified matmul() usage as a possible problem, I will comment on that:
If the matmul result is assigned directly to an array, e.g.
result = matmul(arg1,arg2)
and arg1 and arg2 don't involve sparsity (explicit strides, etc.),
an optimizing compiler ought not to make a hidden temporary array, in my opinion. If it does so, I would suggest a problem report.
In the case where matmul is used in an expression, e.g.
result = result + matmul(arg1,arg2)*scalar
a compiler can't avoid the allocation of a temporary array for the intermediate result. In this case, if the matrix is at all large (20x20 or more), BLAS ?GEMM (called directly, not via a matmul wrapper such as the one in gfortran) is a better choice. The matmul() temporary array can slow it down significantly.
I don't know whether blas95 would be efficient; it may be, particularly with interprocedural optimization.
.



Relevant Pages

  • Re: Question about intel_VEC_memcpy
    ... ran twice as fast as the intel executable, so I think we need to look a bit closer at the intel compiler switches we're using. ... If the matmul result is assigned directly to an array, ... a compiler can't avoid the allocation of a temporary array for the intermediate result. ...
    (comp.lang.fortran)
  • Call for Participation: CGO-5, 11-14 March 2007 - San Jose, California / Online Regist
    ... Programming a Massively Parallel Processor" ... Workshop on EPIC Architectures and Compiler Technology ... Code Generation and Optimization for Transactional Memory Constructs ... Cheng Wang (Intel Corporation), Wei-Yu Chen ...
    (comp.programming)
  • Re: New Visual Fortran Product Survey
    ... I assume that MS authorized DEC to use the name, and that that authorization transferred to Compaq/HP and then to Intel when the compiler/components/rights were successively purchased by those companies. ... HP does own and retain the rights to DEC/Compaq Visual Fortran, and their disuse of it does not give anyone else a right to use it. ... I believe the wording which says the product is only for Windows X64 EM64T and AMD64, for which only a minority of the vendors mentioned above have a product. ... I would expect most of the above vendors to produce such a compiler for whatever architecture becomes predominant, ...
    (comp.lang.fortran)
  • Re: Inline assembler reference
    ... I do know how to beat the compiler with float/SSE* instructions. ... CALL instruction most of the time when inlining it could avoid those ... The Intel and GCC ...
    (microsoft.public.win32.programmer.kernel)
  • Re: Itwill take over your computer, be afraid, be really afraid
    ... In defense of the Intel team and IMHO it is worth the effort to get the ... Intel compiler working on your machine. ... the KDP-2 Optical Design Program ...
    (comp.lang.fortran)