Re: Memset me up Scotty.

From: Iman Habib (pixelpajasREMOVETHIS_at_hotmail.com)
Date: 02/25/04

  • Next message: Matt Taylor: "Re: Memset me up Scotty."
    Date: Wed, 25 Feb 2004 09:31:04 +0000 (UTC)
    
    

    Hey Matt or anyone else that can help me out for that matter!
    Could you help me out here please.

    I copypasted and adjusted your duff's device asm version
    to try it out. But it is not quite behaving as it should. =/

    Its a number or something wrong somewhere
    I have been staring myself blind at the code but can
    not find the problem. =(

    Here is the function snipped out of my memtest.asm file

    -------------8<-----------8<--------------
    ;desc. nams source file: memtest.asm

    bits 32

    global _mmx_memset

    section .text

    align 16

    ;;; void __cdecl mmx_memset(void *d, int val, int len);
    _mmx_memset:

        push ebp
        mov ebp,esp
        pushad

        mov edx, [ebp+8] ; destination
        movd mm0, [ebp+12] ; 32bit value to fill with
        mov eax, [ebp+16] ; lenght of array (in dwords, not bytes)

        punpckldq mm0, mm0

        lea ecx, [eax-1] ; ecx = len - 1
        xor ecx, 7 ; ecx = 8 - len
        and ecx, 7 ; ecx = (8 - len) & 7
        add eax, ecx ; round len up
        jmp [starttbl+ecx*4]

    l0:
        movntq [edx+eax*8-8], mm0
    l1:
        movntq [edx+eax*8-16], mm0
    l2:
        movntq [edx+eax*8-24], mm0
    l3:
        movntq [edx+eax*8-32], mm0
    l4:
        movntq [edx+eax*8-40], mm0
    l5:
        movntq [edx+eax*8-48], mm0
    l6:
        movntq [edx+eax*8-56], mm0
    l7:
        movntq [edx+eax*8-64], mm0

        dec eax
        jnz l0

        emms
        popad
        pop ebp
        ret

    starttbl:
     dd l0
     dd l1
     dd l2
     dd l3
     dd l4
     dd l5
     dd l6
     dd l7

    -------------8<-----------8<--------------

    "Matt Taylor" <para@tampabay.rr.com> wrote in message
    news:VOo_b.82400$Po1.44398@twister.tampabay.rr.com...
    > "Iman Habib" <pixelpajasREMOVETHIS@hotmail.com> wrote in message
    > news:c1cdtf$1ft6or$1@ID-168056.news.uni-berlin.de...
    > > Hi guys..
    > >
    > > I'm trying to pull out a fast memset routine out of my magic hat
    > > for a toy 3D engine of mine.
    > >
    > > And to be honest.. I suck at assembly optimizations. =...(
    > > The routine i have manged to make is about twise as fast as
    > > regular "rep stosd" 32 bit memset on my AMD Athlon XP.
    > > But I am still not content as I have a gut feeling that it is
    > > possible to make it faster.
    >
    > The K7 Optimization Manual will be a help for starters:
    > http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
    >
    > There is a section on memcpy() which may help you optimize your memset().
    >
    > > So i'll let you guys poke at my memset code
    > > and se if you can find more places to optmize. =)
    > >
    > > Or even better.. some of you may have links to webpages that have better
    > > code
    > >
    > > cheers
    > > //iman
    > >
    > > -----------------8<----------------8<--------------------
    > >
    > > inline void memset32mmx(unsigned int *dest, unsigned int c, unsigned int
    > > len)
    > > {
    > > unsigned int apa[2];
    > > apa[0] = apa[1] = c;
    > >
    > > if(len < 2) { // i know i can remove the code here.. remake it,
    > put
    > > it in the next
    > > _asm { // asm block and make it a bit faster.. but it wont
    > be
    > > significant.. do it later
    > > mov eax,c
    > > mov edi,dest
    > > mov ecx,len
    > > cld
    > > rep stosd
    > > }
    > > return;
    > > }
    > >
    > > _asm {
    > > mov edx, [dest]
    > > mov eax, len
    > > mov ecx, eax
    > > shr eax, 1 //len/2
    > > and ecx, 1 //len%2
    > > movd mm1, c
    >
    > This movd appears to do nothing. Also, rather than tying up an extra
    > register, I would do this:
    >
    > mov edx, [dest]
    > mov ecx, [len]
    > mov eax, [c]
    > shr ecx, 1
    > cmovnc eax, [dest+ecx*8]
    > mov [edx+ecx*8], eax
    >
    > This assumes that you can write 1 element beyond the end which may not be
    > true. If not, use a jnc as you do inside your loop. The jnc has the added
    > benefit of not needing an extra register.
    >
    > > movq mm0, [apa]
    >
    > You would probably be better off doing this:
    >
    > movq mm0, [c]
    > punpckldq mm0, mm0
    >
    > This does the same thing, and it avoids possible STLF penalties. (I don't
    > know if this is far enough away from the initialization of apa to matter.)
    >
    > > l:
    > > movntq [edx], mm0
    > > add edx, 8
    > > dec eax
    > > jnz l
    >
    > First rewrite the loop like this:
    >
    > l:
    > movntq [edx+eax*8-8], mm0
    > dec eax
    > jnz l
    >
    > Now unroll:
    >
    > lea ecx, [eax-1] ; ecx = len - 1
    > xor ecx, 7 ; ecx = 8 - len
    > and ecx, 7 ; ecx = (8 - len) & 7
    > add eax, ecx ; round len up
    > jmp [starttbl+ecx*4]
    >
    > l0:
    > movntq [edx+eax*8-8], mm0
    > l1:
    > movntq [edx+eax*8-16], mm0
    > l2:
    > movntq [edx+eax*8-24], mm0
    > l3:
    > movntq [edx+eax*8-32], mm0
    > l4:
    > movntq [edx+eax*8-40], mm0
    > l5:
    > movntq [edx+eax*8-48], mm0
    > l6:
    > movntq [edx+eax*8-56], mm0
    > l7:
    > movntq [edx+eax*8-64], mm0
    > dec eax
    > jnz l0
    >
    > starttbl:
    > dd l0
    > dd l1
    > dd l2
    > dd l3
    > dd l4
    > dd l5
    > dd l6
    > dd l7
    >
    > It isn't really possible to do this in inline assembly, but you can express
    > this in plain C. See http://www.azillionmonkeys.com/qed/case2.html for
    > information on Duff's Device. Duff's Device is used for loop unrolling in C.
    > Here we are unrolling a loop in assembly. The only problem now is that
    > registers are not preserved outside of _asm blocks.
    >
    > Within the last several years, various compilers have added support for the
    > MMX intrinsics. These intrinsics allow you to use MMX instructions and reap
    > the benefits of an optimizing compiler and a high-level language. Your
    > original loop would be written this way:
    >
    > __m64 mc = _mm_set1_pi32(c);
    >
    > for(unsigned int i = 0; i < (len / 2); i++)
    > _mm_stream_pi(((__m64 *) dest) + i, mc);
    >
    > if (len & 1)
    > dest[len - 1] = c;
    >
    > You can use MMX intrinsics with Duff's Device to get output approximately
    > similar to what I wrote above. The only other way to unroll like that is to
    > move to a real assembler.
    >
    > > test ecx, ecx
    > > je q
    > > sub edx, 4
    > > movntq [edx], mm0
    > > q:
    > >
    > > // sfence
    > > emms
    > > }
    > > }
    >
    > Athlon can write 16 bytes to the cache each cycle. Right now you're using
    > about 25% of that bandwidth. You can't get a 400% speed improvement because
    > of the memory bus, but you probably have a bit of headroom left.
    >
    > -Matt
    >


  • Next message: Matt Taylor: "Re: Memset me up Scotty."

    Relevant Pages

    • Re: mechanism for planetary growth
      ... This claim is incorrect and not supported by the evidence simply ... there is no evidence that HH-30 is creating matter. ... object from its previous created accretion disk. ... If I point out that a closed e loop ought to display the features ...
      (sci.geo.geology)
    • Re: acceptance of forth
      ... I think it's a matter of focus. ... In retrospect, you might have saved some time by learning on a friendlier system until you got the basics down, and then seeing how they apply in OF. ... can too easily be assumed to be the main purpose of the loop. ...
      (comp.lang.forth)
    • Re: Memset me up Scotty.
      ... >> I'm trying to pull out a fast memset routine out of my magic hat ... use a jnc as you do inside your loop. ... > MMX intrinsics. ... The only other way to unroll like that is ...
      (comp.lang.asm.x86)
    • Re: asynchat sends data on async_chat.push and .push_with_producer
      ... handle_sendcallbacks within the select loop. ... It's not a matter of being broken at all, ... TCP/IP buffer early, the data gets sent earlier, thus reducing ... assuming the underlying TCP/IP buffer isn't filled (which may or may ...
      (comp.lang.python)
    • Re: piplining principles (and confusion!)
      ... there is no way to unroll the inner loop by ... It is loop invarient code that a compiler will hoist. ... > you won't flush the pipeline because the code is far off, ... Normally, this sort of ...
      (comp.lang.asm.x86)

    Loading