Re: How to cast using MSVC++ Intrinsics



Gerd Isenberg wrote:
Yes 16 byte aligned data.
For 8 byte aligned integers data i thought before two MOVQ and one
PUNPCKLQDQ instruction is best, bur never considerd MOVLPD/MOVHPD for
integers.

movq tends to be a more expensive solution as it has to zero the high quadword. movsd (SSE2) would be nice but I couldn't get it to work.

Interesting.
Are we talking about SSE2 in 32-bit mode or 64-bit mode?

I haven't done a lot of testing in 64-bit mode. I don't think it matters as the documentation doesn't make a distinction. I have a Windows 64 Beta on a partition on my notebook and need to get a release version from a friend with MSDN. I plan to get started on some development stuff with it one of these days. I have a new PowerMac G5 which I'm also playing with and am getting up to speed on Altivec.

A union of m128 and m128i?

I think that it was done with pointers. I don't have the code anymore as I found a few other approaches that are useful.

Namely that MSVC++ tosses in a movaps
to load the SIMD register from the stack on a quadword floating
point load. I wasn't able to find a way to get rid of this cost.
The other problem is that MSVC++ intrinsics generates lousy code.

I had similar problems with msvc++ 6. But the 2005ß seems much better IMHO (32-bit mode).

I was able to get around the extra movaps by loading the high quadword as well. What would be nice is to have init to 0 or -1 instructions. If you put in a pxor or pcmpeq, you'll get a warning and a movaps for initialization.

VS2005 is much better but I think that GCC does a better job at
autovectorization. And there are more architecture choices.

Hmm.. i had some sse2 intrinsic routines where all eight available xmm
registers are used in a rather optimal way - in 32-bit mode.

If you have code that needs them, then it uses them. I'm writing some very short routines to optimize some open source code so what I'm doing ranges from a few instructions to a few hundred instructions.

Inline assembler has its issues too in that there are many constructs
that you can't pass to inline.

In 64-bit mode it might be ok, since some floats/doubles may still
resists in other regsiters. Have you noticed SOG 5.16 Interleave Loads
and Stores:

I'm not doing any computation here. All I want to do is make a copy of an object instance. So I just read from memory and then write to memory. So I can use any type that I want to for some of these routines.

Rationale

When using SSE and SSE2 instructions to perform loads and stores, it is
best to interleave them in the following pattern-Load, Store, Load,
Store, Load, Store, etc. This enables the processor to maximize the
load/store bandwidth.

I'm partial to load, load, store, store or load, load, load, store, store, store as the mov instructions can have latencies of one, two, three or four with quadwords. If you do a load, store, load, store, the first store may have to wait for one or more cycles before it can start to execute. I generally try to schedule instructions so that the data is there, if from L1, by the time I have in instruction that needs to use the data.

Thanks for pointing out the movqda problem, i was not aware of.

I can't wait to get a dual-core low-power A64 system but I suspect that I'll have to wait until next year. Already used up this year's computing budget on the PowerMac.

.



Relevant Pages

  • Re: How to cast using MSVC++ Intrinsics
    ... >>>to be able to load a vector of four integers but using movlps ... >>>and movhps which are floating point instructions. ... > AMD K8 processors. ... It is preferable to load or store the 64-bit ...
    (comp.lang.asm.x86)
  • Re: questions about Public Constants
    ... You have an OBSCENE amount of processing on your computer now. ... 20 million vba instructions per second. ... However, in both cases, VBA, or the macro can execute the command to load ...
    (microsoft.public.access.modulesdaovba)
  • Re: IBM 45nm -- new or licensed from Intel?
    ... It's obvious that it is always better to load ... Though on RISCs you need several sethi/setlo instructions ... number of registers (which is not the case between x86-64 and ARM), ... pipeline length can be hidden by predecoding at the cost of ICache size ...
    (comp.arch)
  • Re: OT: Spanish (I think) translator help, please
    ... This sentence doesn't give general instructions, I'm pretty sure practi-taza is the name for a cup included with it, you might call it "practi-scoop", but it will be unique to that company, rather than some kind of size that you'd know if you speak spanish. ... If the weight per load is 5lb for wherever this is from, I'm guessing a US front loader takes 15-20lb, so you'd multiply the amount needed by 3-4, but without the cup, that's going to give a wide range of values. ...
    (rec.crafts.textiles.quilting)
  • Re: How to implement the speculative loading?
    ... Must it re-execute ALL the instructions that after the ... I'm not quite sure what a "speculative load" is, ... directly dependant on the load can't execute until the load finishes ...
    (comp.arch)