Performance problem when using 'movaps' SSE instruction.

From: AndersX (aslx_at_private.dk)
Date: 05/18/04


Date: Tue, 18 May 2004 17:21:44 -0400

I am developing a floating-point DSP-application, which I hoped could be
optimized a lot by using the SSE instructions set.

Unfortunately I ran into the following problem:

When the instruction 'movaps' reads from a memory location that has been
modified by the instructions before it, it executes about ten times slower
than "expected" (which is 2-5 cycles).

There seems to be a "unknown" latency from the time a memory location has
been written to the point where the 'movaps' can read this location
(fast).

The algorithm can not be fully vertorized, therefore data is modified just
before they go into the SSE processing unit.

The following piece of code DEMONSTRATES the problem:

When 'M' is set to zero the memory location that is read by 'movaps' is
modified just before 'movaps' is executed. This causes the execution time
to drop by a factor of 6. If M is set to 96 the lantency disappear.

void vector(float* fp,int n)
{
  int N4 = n/4;
  int M = 0;

  _asm {

    mov ecx,N4
    mov eax,DWORD PTR [fp]
    mov edi,eax
    add edi,[M]
  LA:
    mov DWORD PTR [edi],0 // dummy write
    movaps xmm0, XMMWORD PTR [eax]
    add eax,16
    add edi,16
    loop LA
  }
}

I hope someone can help on this, since this "bottleneck"
completely ruins my plans for optimizing the algorithm.



Relevant Pages

  • Re: Volatile variables
    ... >>Personally I believe a compiler is required to insert eieio. ... > and there is a need for an I/O synchronization instruction. ... It is not the CPU's load and store instructions that do the actual read and ... The way I see it, you issue a write to a memory location, then you do a read ...
    (comp.lang.c)
  • Re: Is there a 6502 disassembler for RISC OS?
    ... later to cope with the 65C02 instructions of the master), ... arbitary RISC OS memory location and disassembly it as if it were at a ...
    (comp.sys.acorn.programmer)
  • Re: Oh why did Apple dump IBM....
    ... instructions and was missing some that you'd've expected to see - LSR, ... but no LSL, for example (I hope - I could check, but I've decided to see ... have written a `memory location to memory location' command - but all ... perhaps with a dedicated register for such operations to avoid ...
    (uk.comp.sys.mac)
  • Re: "Instruction decoder" and "Sign extend -> register file(r0-r15)"
    ... How does ARM processor identify the numbers to be either instructions ... or data before entering into any decoder? ... memory location the processor reads is assumed to be a valid ...
    (comp.sys.arm)
  • Error 0x80246008 and error 998 when starting BITS
    ... Since then I've been having problems with windows update. ... followed the instructions that microsoft had posted. ... I got this response: Error 998 invalid access to memory location. ... It happened again when I tried to register qmgr.dll. ...
    (microsoft.public.windowsupdate)