Re: SSE2

From: Phil Carmody (thefatphil_demunged_at_yahoo.co.uk)
Date: 01/26/04


Date: Mon, 26 Jan 2004 19:20:09 +0000 (UTC)


"Matt Taylor" <para@tampabay.rr.com> writes:
> "Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message
> news:87y8rw156s.fsf_-_@nonospaz.fatphil.org...
> > Ooops, I was hallucinating about the behaviour of the PMULUDQ operation,
> > I'll have to re-think my memory layout, but the question still stands -
> > should I try to use or avoid memory operands?
>
> It depends on the code. If you have a free register and can schedule a load
> without pushing back the rest of your code, it's worthwhile. The code you
> posted looks like it's going to execute for a long time in a loop. In that
> case, the extra registers would be more valuable for unrolling the code, and
> the lower latency of 1 iteration would be meaningless.

A loop of at least 100. I think I'll parallelise by 4, and might
unroll it a bit, although I'm quite happy writing
  111|11111555|55555
   22|22222266|666666
    3|33333337|7777777
     |44444444|88888888
     ^--------/
style loops. Some loops will be many thousands.

I'm glad you think extra parallelisation is probably more valuable, and
can probably overcome latency via brute force attempts at increasing
throughput.

> I think this is probably easier to manage MMX, and even if you don't get
> vector operation, it will be vastly better than integer multiplies.
>
> movd mm0, [a64] ; 0/2/1
> pmuludq mm0, [b64] ; 2/10/1
> movd [c64], mm0 ; 10/2/1

Ah, each snippet of SSE2 code is prefixed immediately with a
fild/fmul/fistp pre-calculation (hence even more need to cover
latencies). I didn't think MMX and FP code could coexist so closely
together. (And yes, for some applications I need the 80-bit FPU,
but for others I could possibly be persuaded to drop down to the
64-bit SSE2 one). (No, I can't do all the FPU stuff as a pre-calculation,
unfortunately.)

2/3 of the pmuludq's are only required for low32 results, and only
1/3 need full 64 bit results. However I'm unable to find an
instruction that maximises 32*32->32 throughput. It seems that
two 32*32->64s via pmuludq is as good as it gets, but at least
providing two of them.

What do the ; n/n/n comments mean? issue cycle/latency/refractory period?

I don't like halving the throughput by looking only at 64-bit MMX.
However, if SSE2, MMX and FP can all happily co-exist, then I might
try to have MMX, FPU, and int units taking on some of the mults if
it appears that the SSE2 unit is making the other units idle.

> Intel doesn't document the latencies of most of these instructions, but I
> think it's 12 cycles total. That's less than a single imul (14-18). With
> full unrolling, you can get a large number of pmuludqs running in parallel
> with better throughput than imul. I looked at an SSE 2 version, but SSE 2
> latency isn't as good as MMX latency, and I don't see any easy way to get
> better throughput.

With 4 of my blocks running in parallel, the latency is effectively
quartered, Do you suspect that SSE2 latency is twice MMX latency?
If so then you might be right (although some combination would
probably be best).

When I've got code that actually works, I'll post it here for forensics.
(May be a while, I don't have a machine I can test on!)

Does anyone have any good ideas about
  if(a<0) a+=b;
for 64 bit values, such that a is already in [63-0] of XMMn, and b isn't?
(I might just let the 32-bit int unit do this stage rather than idling.)

Thanks for your valuable input Matt,

Phil

-- 
Unpatched IE vulnerability: protocol control chars
Description: Circumventing content filters
Reference: http://badwebmasters.net/advisory/012/
Exploit: http://badwebmasters.net/advisory/012/test2.asp


Relevant Pages

  • Re: SSE2
    ... latency and 1 throughput, so pmuludq gives you the same throughput for the ... It *might* be possible to use pmullw/pmulhuw with some swizzling ... > However, if SSE2, MMX and FP can all happily co-exist, then I might ... SSE 2 is a possibility although it has higher latency. ...
    (comp.lang.asm.x86)
  • Re: [PATCH] Re-implemented i586 asm AES (updated)
    ... int->mmx and a separate path for mmx->int. ... between register pairs, then measure the avg latency per ... -- i think the 1 cycle latency mmx plus the 1 cycle latency movd makes ...
    (Linux-Kernel)