Re: SSE2 half as fast as it should be?



spamtrap@xxxxxxxxxx wrote:
Of course this stuff only works if sse, x86, and mmx registers do not
affect each other's throughputs, but from what I read I think they are
independent.

Please let me know if you know of any throughput dependencies between
x86, sse2 and mmx. Also, let me know if any 64-bit instructions
actually decode to 2 micro ops, because that would really destroy
things. If this is a unique endeavor then hopefully it will get
published.


you should have a look at Fog Agner's "ow to opimize for the pentium family ..."( http://www.agner.org/assem/ ), there he explains the sharing of execution-units between mmx and sse(2/3), instruction latency and lots of other stuff.

personally i do not think it is worth the effort to split execution between mmx and sse-registers. this is due to differences in register-sizes ( 64 vs. 128 bit ), which will make it hard to keep data aligned as required and will make another execution-path necessary. which will result in code much harder to maintain.

and using to much code wihtin a loop will render all your careful decoder-throughput-optimizations useless. once decoded, the micro-ops will be kept in trace-cash. if you use to much instructions in a loop, this cash gets trashed, resulting in decoding in each and every loop-iteration.

and this highly elaborated optimizations will make your code highly dependend on a specific processor. optimum performance is also effected by cache-sizes, size of a cache-lines and so on. all this you'd have to keep in mind when writing youre code. whenever you use your code on a different machine, well, you're going to have a problem.


greetings,
andre

.



Relevant Pages

  • Re: fpu code optimisation request
    ... What was said about MMX using the FPU is ... >> instructions. ... Both SIMD SSE and MMX will have the SIMD advantage over x87/SISD SSE on ... SSE isn't going to be very helpful because Pentium-III ...
    (comp.lang.asm.x86)
  • Re: Dual processors?
    ... > additional instructions the processor supports. ... > automatically offloads it to MMX. ... "The Intel MMX technology was introduced into the IA-32 architecture in ...
    (microsoft.public.vb.general.discussion)
  • Re: fpu code optimisation request
    ... > You mean that such instructions like fcmp, fcmove, fld, etc. are faster ... What was said about MMX using the FPU is true -- ... MMX multiply instructions. ... are only distinct because the FP register file, which is shared between x87, ...
    (comp.lang.asm.x86)
  • Re: Managing 64-bit numbers in VB6
    ... Please let me ask another question, considering that floating point ... calculations are simpler than MMX ones. ... >> Don't know if VB will generate MMX instructions or not, ... >> If you're really into needing to control this level of code generation ...
    (microsoft.public.vb.controls)
  • Re: Fastest Code for byte-substitutions in a string?
    ... I am beginning to delve into MMX and SSE. ... so only algorithms which are highly parallel ... simply use general-purpose instructions. ...
    (comp.lang.asm.x86)