Re: SSE2 half as fast as it should be?
- From: spamtrap@xxxxxxxxxx
- Date: 21 Apr 2006 15:41:00 -0700
Mostly AND, XOR, and OR. Occasional sequences with about half
instructions shifts and half add or subtract. It is bit-vector
processing (dot product, AND and population count). I'm beginning to
realize the best method will be to interleave instructions across MMX,
SSE2/3 and x86 because the bottleneck will be the 3 instructions per
cycle decode rate (though dispatch rate is 6).
It for much of the time it will end up doing something like:
t=0 64-bit x86 (EM64T) instruciton
t=0 SSE2 instruction (throughput 2, latency 2)
t=0.5 MMX instruction (throughput 1, latency 2)
t=1 (64-bit x86 instruciton)
t=1.5 MMX instruction
t=2 64-bit x86 instruction
t=2 SSE2 instruction
t=2.5 MMX instruction
..
etc.
Of course this stuff only works if sse, x86, and mmx registers do not
affect each other's throughputs, but from what I read I think they are
independent.
I end up with 1 free instruction every other clock cycle, which will be
for loads and stores. etc.
I'm hoping to bit-vector dot product 64-bits worth of data every 10
instructions (using Harley's method) for each instruction set,
bottlenecking at the micro op instruction decode rate of 3 per clock
cycle. If possible, 3.8 ghz cpu * (3 instructions / 10 instructions) *
64-bits = 72.96 Gbps dot product rate. It is actually a matrix
multiply so there are no cache problems.
Please let me know if you know of any throughput dependencies between
x86, sse2 and mmx. Also, let me know if any 64-bit instructions
actually decode to 2 micro ops, because that would really destroy
things. If this is a unique endeavor then hopefully it will get
published.
Thanks,
AndrewF
.
- Follow-Ups:
- Re: SSE2 half as fast as it should be?
- From: Maarten Kronenburg
- Re: SSE2 half as fast as it should be?
- From: André Kempe
- Re: SSE2 half as fast as it should be?
- References:
- SSE2 half as fast as it should be?
- From: spamtrap
- Re: SSE2 half as fast as it should be?
- From: Maarten Kronenburg
- SSE2 half as fast as it should be?
- Prev by Date: Re: Art of Assembly question
- Next by Date: Re: Population Count in SSE2
- Previous by thread: Re: SSE2 half as fast as it should be?
- Next by thread: Re: SSE2 half as fast as it should be?
- Index(es):
Relevant Pages
|