Re: SSE2 half as fast as it should be?



Andrew,
You cannot always mix SSE2 and MMX instructions so easily
because they operate on different types of registers.
The MMX 64-bit registers are mm0-mm7, and the
XMM 128-bit registers are xmm0-xmm7.
(for 64-bit processors xmm0-xmm15).
Which instructions work on which registers you can find out
in the technical document, see
http://developer.intel.com/design/Pentium4/documentation.htm
under Manuals: Instruction Set Reference.
(On 64-bit processors you also have 16 64-bit standard registers.)
So it is a bit of a puzzle what registers you may choose with what
instructions, or perhaps there are even more than one option.
Maarten.

<spamtrap@xxxxxxxxxx> wrote in message
news:1145659260.924414.48660@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Mostly AND, XOR, and OR. Occasional sequences with about half
instructions shifts and half add or subtract. It is bit-vector
processing (dot product, AND and population count). I'm beginning to
realize the best method will be to interleave instructions across MMX,
SSE2/3 and x86 because the bottleneck will be the 3 instructions per
cycle decode rate (though dispatch rate is 6).

It for much of the time it will end up doing something like:

t=0 64-bit x86 (EM64T) instruciton
t=0 SSE2 instruction (throughput 2, latency 2)
t=0.5 MMX instruction (throughput 1, latency 2)
t=1 (64-bit x86 instruciton)
t=1.5 MMX instruction
t=2 64-bit x86 instruction
t=2 SSE2 instruction
t=2.5 MMX instruction
.
etc.
Of course this stuff only works if sse, x86, and mmx registers do not
affect each other's throughputs, but from what I read I think they are
independent.
I end up with 1 free instruction every other clock cycle, which will be
for loads and stores. etc.

I'm hoping to bit-vector dot product 64-bits worth of data every 10
instructions (using Harley's method) for each instruction set,
bottlenecking at the micro op instruction decode rate of 3 per clock
cycle. If possible, 3.8 ghz cpu * (3 instructions / 10 instructions) *
64-bits = 72.96 Gbps dot product rate. It is actually a matrix
multiply so there are no cache problems.

Please let me know if you know of any throughput dependencies between
x86, sse2 and mmx. Also, let me know if any 64-bit instructions
actually decode to 2 micro ops, because that would really destroy
things. If this is a unique endeavor then hopefully it will get
published.

Thanks,
AndrewF



.



Relevant Pages

  • Re: Two Click disassembly/reassembly
    ... Map the extra x86 registers to memory. ... > equivalents to the string instructions. ... > got such a limited RISC like instruction set that the assembler is more ...
    (alt.lang.asm)
  • Re: scalar integer unit and SIMD share one pipeline
    ... In x86, integer SIMD ... MMX instructions, but MMX in x86 is one functional unit, which means ... I think in x86 MMX has its own register file, register allocation may ...
    (comp.arch)
  • Re: Where do I start (over)?
    ... Availability of 64-bit wide general-purpose registers. ... the things to cross out are all the MMX instructions. ... one of them has an SSE counterpart, so there's little need to use MMX ...
    (comp.lang.asm.x86)
  • Re: 16 byte alignment
    ... Are some of the registers are ... You can use the MMX registers at the same time as the SSE ... Interaction of SSE and SSE2 Instructions with x87 FPU ...
    (borland.public.delphi.language.basm)
  • Re: Help in getting application to access I/O space
    ... instructions that access it and you could write your own versions of ... Based on that address, I'm guessing x86. ... the registers are mapped as offsets into "I/O Space". ...
    (microsoft.public.windowsce.embedded)