How much does it take to execute MMX instruction?



Hello group,

I need to develop a highly optimized MMX based image processing
algorithm. From the Intel Optimization Manual I found worst case
instructions timings. It appears that instruction timings may vary from
execution to execution. It may not be significant problem if you are
not trying to squeeze every piece of performance available for your
application. If extreme performance is the primary goal, then you need
to use all available stuff to speed-up your calculation. The main
advantage could be achieved using instruction pairing in U and V
execution pipes. And here is the biggest contradiction I don't know how
to overcome. All instructions with memory operand may experience one or
two cycles penalty for L1 cache hit. Let's say we plan instruction
pairing with assumption that data will arrive one cycle later. This way
movq mm0,Variable ; 1U
paddw mm3,mm2 ; 1V
paddw mm6,mm7 ; 2U
psrlw mm5,3 ; 2V
paddw mm3,mm0 ; 3U - pitfall
psrlw mm2,3 ; 3V
This code snippet is only for demonstration of the issue. Subsequent
code highly depends on mm0 register value. If delay is more than 1
planned cycle then instruction marked as 3U will be stalled for
additional clock cycle, resulting in destroying of whole calculation
chain, because for cycle 4 there may be its own pair of instructions,
which may not pair with 3U addition. I understand that planning for
worst case latency may help, but early arrived data in conjunction with
out-of-order execution will result in the same type of issue.

Another issue is mixed optimization for Pentium 4 and for Pentium M.
Pentium M, in general, has latency one clock cycle less, than Pentium
4. This way code optimized for Pentium 4 will be executed on Pentium M
almost two times slower, because of broken instruction pairing.

Is there any bullet-proof strategy, which may help to overcome
described issue?

With best regards,
Vladimir S. Mirgorodsky

.



Relevant Pages

  • Re: Why use assembly?
    ... No. Optimization by high level compilers of well written HLL code will ... Saying that "Asm is good for small and fast Code", ... Do not use *multiple instruction* for line ...
    (alt.lang.asm)
  • Re: Cobol Myth Busters
    ... Of course, whether adding 5 is slower, or faster, than adding 1 is completely ... Or he was thinking of a LOOP instruction. ... but it gives an insight into optimization. ... Facts incorrectly interpreted are less useful than an absence of facts. ...
    (comp.lang.cobol)
  • [patch 5/9] Conditional Calls - i386 Optimization
    ... i386 optimization of the cond_calls which uses a movb with code patching to ... * Instruction Execution Results. ... Intel states that unpredictable general ...
    (Linux-Kernel)
  • [patch 5/9] Conditional Calls - i386 Optimization
    ... i386 optimization of the cond_calls which uses a movb with code patching to ... * Instruction Execution Results. ... Intel states that unpredictable general ...
    (Linux-Kernel)
  • Re: About MMXSSE2 PSADBW instruction ?
    ... I have download a VTUNE Evaluation Software to try it and see the generated ... And in the optimization Sofwxare manual, i hope find all information i need ... > the second instruction has to wait for xmm0 to be written due to the ... the second instruction will have to wait 5 cycles ...
    (comp.lang.asm.x86)