Re: SSE2 half as fast as it should be?



Mostly AND, XOR, and OR. Occasional sequences with about half
instructions shifts and half add or subtract. It is bit-vector
processing (dot product, AND and population count). I'm beginning to
realize the best method will be to interleave instructions across MMX,
SSE2/3 and x86 because the bottleneck will be the 3 instructions per
cycle decode rate (though dispatch rate is 6).

It for much of the time it will end up doing something like:

t=0 64-bit x86 (EM64T) instruciton
t=0 SSE2 instruction (throughput 2, latency 2)
t=0.5 MMX instruction (throughput 1, latency 2)
t=1 (64-bit x86 instruciton)
t=1.5 MMX instruction
t=2 64-bit x86 instruction
t=2 SSE2 instruction
t=2.5 MMX instruction
..
etc.
Of course this stuff only works if sse, x86, and mmx registers do not
affect each other's throughputs, but from what I read I think they are
independent.
I end up with 1 free instruction every other clock cycle, which will be
for loads and stores. etc.

I'm hoping to bit-vector dot product 64-bits worth of data every 10
instructions (using Harley's method) for each instruction set,
bottlenecking at the micro op instruction decode rate of 3 per clock
cycle. If possible, 3.8 ghz cpu * (3 instructions / 10 instructions) *
64-bits = 72.96 Gbps dot product rate. It is actually a matrix
multiply so there are no cache problems.

Please let me know if you know of any throughput dependencies between
x86, sse2 and mmx. Also, let me know if any 64-bit instructions
actually decode to 2 micro ops, because that would really destroy
things. If this is a unique endeavor then hopefully it will get
published.

Thanks,
AndrewF

.



Relevant Pages

  • Re: non load/store architecture?
    ... Programmers can write great code on a RISC ... Especially in a PC marketplace dominated by the x86. ... Lots of registers and an orthogonal instruction set are important - both have these. ... The lack of registers means much more memory IO, which causes stalls and requires complex scheduling. ...
    (comp.arch.embedded)
  • Re: [PATCH] x86 - Enhance DEBUG_RODATA support - alternatives
    ... has been pulled out of the x86 tree. ... text_poke required to support this. ... correctly and so the CPU HOTPLUG special case can be removed. ... When you use this code to patch more than one byte of an instruction ...
    (Linux-Kernel)
  • Re: Free FORTH implementations?
    ... There is some older information by Phillip Koopman on instruction usage: ... The addresses are what a FORTH interpreter uses to call ... x86 16-bit code can be similarly sized or smaller because there are ... one with an average instruction size of two or less and the other ...
    (comp.lang.forth)
  • Re: thought: "Mini-x86"...
    ... You will try to execute int 0x80. ... This will fail at load time, the int instruction is not permitted. ... NaCl uses x86 fault mechanisms and segmentation ... This make NaCl only useful on x86. ...
    (comp.lang.asm.x86)
  • Re: atomic increment and fetch on intel ia32
    ... Is "inc mem" some kind of assembly instruction on x86? ... What do you mean by a lock prefix? ... Anyway, x86 inc can increment a register or a memory location, and can ...
    (comp.programming.threads)