Re: SSE2

From: Phil Carmody (thefatphil_demunged_at_yahoo.co.uk)
Date: 01/27/04


Date: Tue, 27 Jan 2004 17:13:19 +0000 (UTC)


"Matt Taylor" <para@tampabay.rr.com> writes:
> > I don't like halving the throughput by looking only at 64-bit MMX.
> > However, if SSE2, MMX and FP can all happily co-exist, then I might
> > try to have MMX, FPU, and int units taking on some of the mults if
> > it appears that the SSE2 unit is making the other units idle.
>
> SSE 2 is a possibility although it has higher latency. I usually favor MMX
> because the underlying implementation favors it. Perhaps with SSE 2 it may
> be worthwhile to pack/unpack to do 4 simultaneous 32x32 multiplies using
> pmullw/pmulhuw.

According to http://www.cen.uiuc.edu/~cjiang/reference/index.htm
pmullw and pmullhuw do 4 (MMX) or 8 (SSE) 16*16->16 multiplies,
not 32*32->32 ones. Oneof the things that I'm trying to do is minimise
the number movement instructions (unpacks, shuffles, as well as movs),
and forming a 32-bit result from 16*16->16 multiplies I think would be
too much effort.

> The best way to find out really is to time both sequences.

I don't own a machine on which I can run such sequences alas.

> > When I've got code that actually works, I'll post it here for forensics.
> > (May be a while, I don't have a machine I can test on!)
> >
> > Does anyone have any good ideas about
> > if(a<0) a+=b;
> > for 64 bit values, such that a is already in [63-0] of XMMn, and b isn't?
> > (I might just let the 32-bit int unit do this stage rather than idling.)
>
> Ah, if only there were a pcmpltq. If you already have a in an SSE register,
> I would keep it there. The latency of adc combined with moving data between
> register files is much worse than a pcmpltd with appropriate logic to handle
> 64-bits.

OK, I'll see if I ca scrape some magical cure-all instruction out of
the depths of that reference. I remember in the past seeing rants about
how intels SIMD instruction sets were half-baked, and had lots of
gaping holes, and that AltiVec was far more complete. Perhaps they were
right, I keep finding instructions that don't exist. Why, for example
does a 4-way float*float instruction exist, but not a 4-way int*int?
The latter would be like a gift from above presently...

Phil

-- 
Unpatched IE vulnerability: document.domain parent DNS resolver
Description: Improper duality check leading to firewall breach 
Published: July 29 2002
Reference: http://online.securityfocus.com/archive/1/284908/2002-07-27/2002-08-02/0


Relevant Pages

  • Re: Loading single word to a xmm register
    ... There are four major instruction set extensions introduced over the ... MMX, SSE, SSE2, and SSE3. ... It boils down, mostly, to SSE ...
    (alt.lang.asm)
  • Re: Shift in Parallel?
    ... Please look at Robert Wessel's post. ... LEA instruction instead of SHL instruction. ... Use Robert's code or MMX code that is better performance than yours. ... Multiply instruction for BYTE, but it does for WORD, DWORD, and QWORD. ...
    (comp.lang.asm.x86)
  • Re: Fastest Code for byte-substitutions in a string?
    ... I am beginning to delve into MMX and SSE. ... so only algorithms which are highly parallel ... simply use general-purpose instructions. ...
    (comp.lang.asm.x86)
  • Re: configure mag alsa nicht finden
    ... Sind diese Features eigentlich außer von MPlayer noch von irgendjemandem ... Die Suche nach ebuilds, die mmx, sse und sse2 als Useflag nutzen, gibt auf ... Freiheit ist immer die Freiheit der Andersdenkenden. ...
    (de.comp.os.unix.linux.misc)
  • Re: fpu code optimisation request
    ... What was said about MMX using the FPU is ... >> instructions. ... Both SIMD SSE and MMX will have the SIMD advantage over x87/SISD SSE on ... SSE isn't going to be very helpful because Pentium-III ...
    (comp.lang.asm.x86)