Re: Memset me up Scotty.
From: Iman Habib (pixelpajasREMOVETHIS_at_hotmail.com)
Date: 02/25/04
- Previous message: Clax86: "Having trouble posting?"
- In reply to: Matt Taylor: "Re: Memset me up Scotty."
- Next in thread: Matt Taylor: "Re: Memset me up Scotty."
- Reply: Matt Taylor: "Re: Memset me up Scotty."
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 25 Feb 2004 09:31:04 +0000 (UTC)
Hey Matt or anyone else that can help me out for that matter!
Could you help me out here please.
I copypasted and adjusted your duff's device asm version
to try it out. But it is not quite behaving as it should. =/
Its a number or something wrong somewhere
I have been staring myself blind at the code but can
not find the problem. =(
Here is the function snipped out of my memtest.asm file
-------------8<-----------8<--------------
;desc. nams source file: memtest.asm
bits 32
global _mmx_memset
section .text
align 16
;;; void __cdecl mmx_memset(void *d, int val, int len);
_mmx_memset:
push ebp
mov ebp,esp
pushad
mov edx, [ebp+8] ; destination
movd mm0, [ebp+12] ; 32bit value to fill with
mov eax, [ebp+16] ; lenght of array (in dwords, not bytes)
punpckldq mm0, mm0
lea ecx, [eax-1] ; ecx = len - 1
xor ecx, 7 ; ecx = 8 - len
and ecx, 7 ; ecx = (8 - len) & 7
add eax, ecx ; round len up
jmp [starttbl+ecx*4]
l0:
movntq [edx+eax*8-8], mm0
l1:
movntq [edx+eax*8-16], mm0
l2:
movntq [edx+eax*8-24], mm0
l3:
movntq [edx+eax*8-32], mm0
l4:
movntq [edx+eax*8-40], mm0
l5:
movntq [edx+eax*8-48], mm0
l6:
movntq [edx+eax*8-56], mm0
l7:
movntq [edx+eax*8-64], mm0
dec eax
jnz l0
emms
popad
pop ebp
ret
starttbl:
dd l0
dd l1
dd l2
dd l3
dd l4
dd l5
dd l6
dd l7
-------------8<-----------8<--------------
"Matt Taylor" <para@tampabay.rr.com> wrote in message
news:VOo_b.82400$Po1.44398@twister.tampabay.rr.com...
> "Iman Habib" <pixelpajasREMOVETHIS@hotmail.com> wrote in message
> news:c1cdtf$1ft6or$1@ID-168056.news.uni-berlin.de...
> > Hi guys..
> >
> > I'm trying to pull out a fast memset routine out of my magic hat
> > for a toy 3D engine of mine.
> >
> > And to be honest.. I suck at assembly optimizations. =...(
> > The routine i have manged to make is about twise as fast as
> > regular "rep stosd" 32 bit memset on my AMD Athlon XP.
> > But I am still not content as I have a gut feeling that it is
> > possible to make it faster.
>
> The K7 Optimization Manual will be a help for starters:
> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
>
> There is a section on memcpy() which may help you optimize your memset().
>
> > So i'll let you guys poke at my memset code
> > and se if you can find more places to optmize. =)
> >
> > Or even better.. some of you may have links to webpages that have better
> > code
> >
> > cheers
> > //iman
> >
> > -----------------8<----------------8<--------------------
> >
> > inline void memset32mmx(unsigned int *dest, unsigned int c, unsigned int
> > len)
> > {
> > unsigned int apa[2];
> > apa[0] = apa[1] = c;
> >
> > if(len < 2) { // i know i can remove the code here.. remake it,
> put
> > it in the next
> > _asm { // asm block and make it a bit faster.. but it wont
> be
> > significant.. do it later
> > mov eax,c
> > mov edi,dest
> > mov ecx,len
> > cld
> > rep stosd
> > }
> > return;
> > }
> >
> > _asm {
> > mov edx, [dest]
> > mov eax, len
> > mov ecx, eax
> > shr eax, 1 //len/2
> > and ecx, 1 //len%2
> > movd mm1, c
>
> This movd appears to do nothing. Also, rather than tying up an extra
> register, I would do this:
>
> mov edx, [dest]
> mov ecx, [len]
> mov eax, [c]
> shr ecx, 1
> cmovnc eax, [dest+ecx*8]
> mov [edx+ecx*8], eax
>
> This assumes that you can write 1 element beyond the end which may not be
> true. If not, use a jnc as you do inside your loop. The jnc has the added
> benefit of not needing an extra register.
>
> > movq mm0, [apa]
>
> You would probably be better off doing this:
>
> movq mm0, [c]
> punpckldq mm0, mm0
>
> This does the same thing, and it avoids possible STLF penalties. (I don't
> know if this is far enough away from the initialization of apa to matter.)
>
> > l:
> > movntq [edx], mm0
> > add edx, 8
> > dec eax
> > jnz l
>
> First rewrite the loop like this:
>
> l:
> movntq [edx+eax*8-8], mm0
> dec eax
> jnz l
>
> Now unroll:
>
> lea ecx, [eax-1] ; ecx = len - 1
> xor ecx, 7 ; ecx = 8 - len
> and ecx, 7 ; ecx = (8 - len) & 7
> add eax, ecx ; round len up
> jmp [starttbl+ecx*4]
>
> l0:
> movntq [edx+eax*8-8], mm0
> l1:
> movntq [edx+eax*8-16], mm0
> l2:
> movntq [edx+eax*8-24], mm0
> l3:
> movntq [edx+eax*8-32], mm0
> l4:
> movntq [edx+eax*8-40], mm0
> l5:
> movntq [edx+eax*8-48], mm0
> l6:
> movntq [edx+eax*8-56], mm0
> l7:
> movntq [edx+eax*8-64], mm0
> dec eax
> jnz l0
>
> starttbl:
> dd l0
> dd l1
> dd l2
> dd l3
> dd l4
> dd l5
> dd l6
> dd l7
>
> It isn't really possible to do this in inline assembly, but you can express
> this in plain C. See http://www.azillionmonkeys.com/qed/case2.html for
> information on Duff's Device. Duff's Device is used for loop unrolling in C.
> Here we are unrolling a loop in assembly. The only problem now is that
> registers are not preserved outside of _asm blocks.
>
> Within the last several years, various compilers have added support for the
> MMX intrinsics. These intrinsics allow you to use MMX instructions and reap
> the benefits of an optimizing compiler and a high-level language. Your
> original loop would be written this way:
>
> __m64 mc = _mm_set1_pi32(c);
>
> for(unsigned int i = 0; i < (len / 2); i++)
> _mm_stream_pi(((__m64 *) dest) + i, mc);
>
> if (len & 1)
> dest[len - 1] = c;
>
> You can use MMX intrinsics with Duff's Device to get output approximately
> similar to what I wrote above. The only other way to unroll like that is to
> move to a real assembler.
>
> > test ecx, ecx
> > je q
> > sub edx, 4
> > movntq [edx], mm0
> > q:
> >
> > // sfence
> > emms
> > }
> > }
>
> Athlon can write 16 bytes to the cache each cycle. Right now you're using
> about 25% of that bandwidth. You can't get a 400% speed improvement because
> of the memory bus, but you probably have a bit of headroom left.
>
> -Matt
>
- Previous message: Clax86: "Having trouble posting?"
- In reply to: Matt Taylor: "Re: Memset me up Scotty."
- Next in thread: Matt Taylor: "Re: Memset me up Scotty."
- Reply: Matt Taylor: "Re: Memset me up Scotty."
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|