Re: How much does it take to execute MMX instruction?
- From: spamtrap@xxxxxxxxxx
- Date: 11 Jul 2006 00:58:00 -0700
Unfortunatelly, there is a good base for my suspicions :(
We have the following setup: a unrolled loop with lots of nop's in the
beginning, our MMX code follows the nops. At the beginning and at the
end of the loop we perform RDTSC to check the starting and final
timestamps, we iterate through the loop 64536 times. Such a big number
of iterations removes many side effects with first-time loading of data
into the cache, pre-loop context, etc. We run the setup and average the
result. This way we have accurate enough instruction timings. About
five days of experiments gave nothing in terms of experience in the
field of instruction timings :( Two consecutive instructions may
execute very fast or very slow depending on some very hidden CPU
features. For now our loop get executed for 60 cycles, switching places
of non-related instructions may increase this timing to 120. And the
most intriguing part is that the same piece of code 10 instructions
later does not react on instruction swapping in any way.
I have both programming and hardware experience. I tried to analyse CPU
performance according to most probable hardware implementation. I was
able to track down the most base timings only. The same instruction
with the same dependencies has VERY different timing because of
preceding history, CPU features and so on. Worst case estimations from
Intel does not do any good, since code based on such assumptions still
very inefficient in terms of timing.
You may think, that such precise optimization is not necessary for
today's CPUs, but we really need to be able to perform a tons of
calculations on 400MB array of data per second. With such load most
inner loop executing 10 cycles more than expected decreases our
throughput by 10-15% at once.
BTW, switching to SSE3 did not produced 100% throughput increase. The
processing chunk got two times bigger, but processing time also
increased two times. The only difference was in some memory operations.
With SSE3 they were a little bit faster. We doing our experiments on P4
with HyperThreading 3GHz, family 15, model 4, stepping 3.
Sure, it is possible to develop the code, checking the inner loop
timing after adding every instruction to the program, but the most
effective way is to understand what is going on behind the scene. I
know, that P4 has four execution ports, I know that instructions may
get executed out of order. Now I know that U- and V-pipes is not any
more relevant for NetBurst, but whole Intel Optimization Manual in
conjunction with Anfer Fog's Optimization Manual did not bring me any
close to understanding of what is going on inside CPU. Instruction
timings vary significantly just because they do so :(
BTW, for different architectures VTune generates completelly different
code, which in turn also is not the best in terms of timings. We were
able slightly outperform code reordered with VTune comparing to
hand-tuning version.
With best regards,
Vladimir S. Mirgorodsky
jukka@xxxxxxxxxxxx wrote:
Another issue is mixed optimization for Pentium 4 and for Pentium M.
Pentium M, in general, has latency one clock cycle less, than Pentium
4. This way code optimized for Pentium 4 will be executed on Pentium M
almost two times slower, because of broken instruction pairing.
Dude, the pairing is relevant on the ("original") Pentium
microarchitechture, Pentium4 is using the NetBurst microarchitechture.
Pairing and such are obsolete concepts on that implementation of the
x86.
The PentiumM is more based on Pentium-PRO (sp?) microarchitechture than
anything else, they got totally different setup for decoding and
issuing instructions to the ALU implementation internally.
These three are all different, and the UV-pipe setup is relevant only
to the former, ancient implementation. It just doesn't apply anymore
these days. Pick which of these are the relevant for your software then
write code those in mind, or find a compromise which works reasonably
on all. But that's not something where assembly is the best pick
performance wise as you're not simply using that as metric anymore.
.
- Follow-Ups:
- Re: How much does it take to execute MMX instruction?
- From: Bertrand Augereau
- Re: How much does it take to execute MMX instruction?
- References:
- How much does it take to execute MMX instruction?
- From: spamtrap
- Re: How much does it take to execute MMX instruction?
- From: jukka@xxxxxxxxxxxx
- How much does it take to execute MMX instruction?
- Prev by Date: Re: Need Help.
- Next by Date: Re: Need Help.
- Previous by thread: Re: How much does it take to execute MMX instruction?
- Next by thread: Re: How much does it take to execute MMX instruction?
- Index(es):