LDDQU vs. MOVDQU



LDDQU is the load unaligned op in SSE3 that has the same interface as
the old MOVDQU of SSE2.

Questions:

1. Is this purely an implementation detail? If so why have a
distinctly different op rather than "upgrade" the older op when SSE3
came out?
2. All the (sparse) online docs say don't use LDDQU in a store-load
forwarding situation, use MOVDQU instead. I presume that if the intent
is to do pure streaming i.e. reading from x and storing into distinctly
different y (fire and forget), then LDDQU is the appropriate op?
3. The (sparse) online docs also say that LDDQU works better across
cache lines because it is 2 aligned loads + a realign, rather than 2
part loads lie MOVDQU. Why?
4. When you use LDDQU in a streaming sequential load, do I end up
with double the number of memory accesses (due to the implicit 2
aligned loads) or is the Intel wizardry saavy enough to factor out the
repeated loads?

I'm implementing cross-platform unaligned SIMD loads in macstl and want
to do The Right Thing (TM).

http://www.pixelglow.com/macstl/

Cheers
Glen Low, Pixelglow Software
www.pixelglow.com

.



Relevant Pages

  • Re: Fastcode Move B&V 7.0
    ... replacing the movdqu instruction with the lddqu ... The lddqu instruction loads 128 bits. ...
    (borland.public.delphi.language.basm)
  • Re: Fastcode Move B&V 7.0
    ... Just replace movdqu with lddqu. ... > my Prescott. ... I have just emailed you a new unit containing SSE2 and SSE3 functions. ...
    (borland.public.delphi.language.basm)