Re: How to cast using MSVC++ Intrinsics
- From: Michael Moy <spamtrap@xxxxxxxxxx>
- Date: Wed, 11 May 2005 04:40:21 +0000 (UTC)
Gerd Isenberg wrote:
Yes 16 byte aligned data. For 8 byte aligned integers data i thought before two MOVQ and one PUNPCKLQDQ instruction is best, bur never considerd MOVLPD/MOVHPD for integers.
movq tends to be a more expensive solution as it has to zero the high quadword. movsd (SSE2) would be nice but I couldn't get it to work.
Interesting. Are we talking about SSE2 in 32-bit mode or 64-bit mode?
I haven't done a lot of testing in 64-bit mode. I don't think it matters as the documentation doesn't make a distinction. I have a Windows 64 Beta on a partition on my notebook and need to get a release version from a friend with MSDN. I plan to get started on some development stuff with it one of these days. I have a new PowerMac G5 which I'm also playing with and am getting up to speed on Altivec.
A union of m128 and m128i?
I think that it was done with pointers. I don't have the code anymore as I found a few other approaches that are useful.
Namely that MSVC++ tosses in a movaps to load the SIMD register from the stack on a quadword floating point load. I wasn't able to find a way to get rid of this cost. The other problem is that MSVC++ intrinsics generates lousy code.
I had similar problems with msvc++ 6. But the 2005ß seems much better IMHO (32-bit mode).
I was able to get around the extra movaps by loading the high quadword as well. What would be nice is to have init to 0 or -1 instructions. If you put in a pxor or pcmpeq, you'll get a warning and a movaps for initialization.
VS2005 is much better but I think that GCC does a better job at autovectorization. And there are more architecture choices.
Hmm.. i had some sse2 intrinsic routines where all eight available xmm registers are used in a rather optimal way - in 32-bit mode.
If you have code that needs them, then it uses them. I'm writing some very short routines to optimize some open source code so what I'm doing ranges from a few instructions to a few hundred instructions.
Inline assembler has its issues too in that there are many constructs that you can't pass to inline.
In 64-bit mode it might be ok, since some floats/doubles may still resists in other regsiters. Have you noticed SOG 5.16 Interleave Loads and Stores:
I'm not doing any computation here. All I want to do is make a copy of an object instance. So I just read from memory and then write to memory. So I can use any type that I want to for some of these routines.
Rationale
When using SSE and SSE2 instructions to perform loads and stores, it is best to interleave them in the following pattern-Load, Store, Load, Store, Load, Store, etc. This enables the processor to maximize the load/store bandwidth.
I'm partial to load, load, store, store or load, load, load, store, store, store as the mov instructions can have latencies of one, two, three or four with quadwords. If you do a load, store, load, store, the first store may have to wait for one or more cycles before it can start to execute. I generally try to schedule instructions so that the data is there, if from L1, by the time I have in instruction that needs to use the data.
Thanks for pointing out the movqda problem, i was not aware of.
I can't wait to get a dual-core low-power A64 system but I suspect that I'll have to wait until next year. Already used up this year's computing budget on the PowerMac.
.
- References:
- How to cast using MSVC++ Intrinsics
- From: Michael Moy
- Re: How to cast using MSVC++ Intrinsics
- From: Gerd Isenberg
- Re: How to cast using MSVC++ Intrinsics
- From: Michael Moy
- Re: How to cast using MSVC++ Intrinsics
- From: Gerd Isenberg
- How to cast using MSVC++ Intrinsics
- Prev by Date: 128-bit MMX versus 32-bit memory copy
- Next by Date: Re: X86(Ia32) Asm Future
- Previous by thread: Re: How to cast using MSVC++ Intrinsics
- Next by thread: X86(Ia32) Asm Future
- Index(es):
Relevant Pages
|