Performance problem when using 'movaps' SSE instruction.

From: AndersX (aslx_at_private.dk)
Date: 05/18/04


Date: Tue, 18 May 2004 17:21:44 -0400

I am developing a floating-point DSP-application, which I hoped could be
optimized a lot by using the SSE instructions set.

Unfortunately I ran into the following problem:

When the instruction 'movaps' reads from a memory location that has been
modified by the instructions before it, it executes about ten times slower
than "expected" (which is 2-5 cycles).

There seems to be a "unknown" latency from the time a memory location has
been written to the point where the 'movaps' can read this location
(fast).

The algorithm can not be fully vertorized, therefore data is modified just
before they go into the SSE processing unit.

The following piece of code DEMONSTRATES the problem:

When 'M' is set to zero the memory location that is read by 'movaps' is
modified just before 'movaps' is executed. This causes the execution time
to drop by a factor of 6. If M is set to 96 the lantency disappear.

void vector(float* fp,int n)
{
  int N4 = n/4;
  int M = 0;

  _asm {

    mov ecx,N4
    mov eax,DWORD PTR [fp]
    mov edi,eax
    add edi,[M]
  LA:
    mov DWORD PTR [edi],0 // dummy write
    movaps xmm0, XMMWORD PTR [eax]
    add eax,16
    add edi,16
    loop LA
  }
}

I hope someone can help on this, since this "bottleneck"
completely ruins my plans for optimizing the algorithm.



Relevant Pages

  • Re: Volatile variables
    ... >>Personally I believe a compiler is required to insert eieio. ... > and there is a need for an I/O synchronization instruction. ... It is not the CPU's load and store instructions that do the actual read and ... The way I see it, you issue a write to a memory location, then you do a read ...
    (comp.lang.c)
  • Re: Is there a 6502 disassembler for RISC OS?
    ... later to cope with the 65C02 instructions of the master), ... arbitary RISC OS memory location and disassembly it as if it were at a ...
    (comp.sys.acorn.programmer)
  • Re: Oh why did Apple dump IBM....
    ... instructions and was missing some that you'd've expected to see - LSR, ... but no LSL, for example (I hope - I could check, but I've decided to see ... have written a `memory location to memory location' command - but all ... perhaps with a dedicated register for such operations to avoid ...
    (uk.comp.sys.mac)
  • [PATCH 2/2] x86, crypto, Use gas macro for AES-NI instructions
    ... Old binutils do not support AES-NI instructions, ... movaps %xmm2, ... movaps (TKEYP), KEY ...
    (Linux-Kernel)
  • [PATCH 1/2] x86, crypto, Use gas macro for AES-NI instructions
    ... Old binutils do not support AES-NI instructions, ... movaps %xmm2, ... movaps (TKEYP), KEY ...
    (Linux-Kernel)