Re: Cost of calling a standard library function

From: The Half A Wannabee ("The)
Date: 03/03/04


Date: Wed, 3 Mar 2004 16:04:05 +0100


"C" <blackmarlin@asean-mail.com> wrote in message
news:33d97ee5.0403030441.2fa1780c@posting.google.com...
> "The Half A Wannabee" <ShakainZulu_AT(Do you See me
now?#0)_HotIQForReal.NET.com> wrote in message
news:<4044eefd$1@news.broadpark.no>...
>
> > HMMM!!!! I am not sure I understand theese timings!
> > It would almost seem that ebp is more costly to push
> > than other registers, and that edi edi is more costly
> > to push that eax edx ??????? No I am to tired (or
> > stupid). Maybe a more advance asm programmer would
> > care to comment ? Randy this is you call....? OR
> > better, Wolfgang,Betov or somebody ;-)
>
> EBP is not actually more costly to push, but there can
> be circumstances where it generates stalls making it
> appear more costly. I assume you are creating and
> destorying a stack from something like...
>
> PUSH EBP
> MOV EBP, ESP
> ; your code
> MOV ESP, EBP
> POP EBP ; { AGI stall }
> RET

No need to assume. The code was in my mail : Here it is again. (below)
This is Beth's solution for an inline copyrect (no call). So NO stack frame
(due to a call) is present at all. (I am fudging poet, and now I am a lun)

call TPerformanceCounter_TimeStampRemark TestRemark5
    push edi
       mov ecx 1_000_000
       @TestLoop3:
          push edi esi ebp
            mov esi ARect
            mov edi BRect

            mov eax D$esi + TRect_Left
            mov ebx D$esi + TRect_Top
            mov ebp D$esi + TRect_Right
            mov edx D$esi + TRect_Bottom

            mov D$edi + TRect_Left eax
            mov D$edi + TRect_Top ebx
            mov D$edi + TRect_Right ebp
            mov D$edi + TRect_Bottom edx

          pop ebp esi edi
       sub ecx 1
       jnc @TestLoop3
    pop edi
    call TPerformanceCounter_TimeStampRemark TestRemark2

Below I repeat the code that was 6000 TICK faster than the above. Still
Beths code, only now esp does not get pushed/poped, but saved away in a
memoryvariable.
(my computer generates 3579545 ticks per second).

[MemoryStack: &NULL]
call TPerformanceCounter_TimeStampRemark TestRemark7
     push edi
       mov ecx 1_000_000
       @TestLoop5:
          mov D$MemoryStack ebp
          push edi esi
            mov esi ARect
            mov edi BRect

            mov eax D$esi + TRect_Left
            mov ebx D$esi + TRect_Top
            mov ebp D$esi + TRect_Right
            mov edx D$esi + TRect_Bottom

            mov D$edi + TRect_Left eax
            mov D$edi + TRect_Top ebx
            mov D$edi + TRect_Right ebp
            mov D$edi + TRect_Bottom edx

          pop esi edi
          mov ebp D$MemoryStack
       sub ecx 1
       jnc @TestLoop5
    pop edi

So there should occur no spesial AGI stall because of this? As I see it, it
"proves" that push/pop is more costly than move to/from memory. This makes
sense, since push Allocates memory, and pop deallocates it. But still one
should hope the spesial processor instruction would do it faster.

But this is not the weirdest thing. The weird thing is, that in the code
below, where all push/pops are made into memory moves, for a total of 3/6
memory moves compared to originally 3 pushes, is not faster by 12000 tics,
but only by less then < 2000. So while changing push ebp to mov mem32 ebp
got us 6000 ticks in total, the replacements of edi AND esi pushes, got us
only another 2000 ticks.

[NumberOfTests = 1_000_000]
and (in the final results)
[NumberOfTests = 100_000_000]

call TPerformanceCounter_TimeStampRemark TestRemark8
       push edi
       mov ecx NumberOfTests
       @TestLoop6:
          mov D$MemoryStack ebp
          mov D$MemoryStack + 4 edi
          mov D$MemoryStack + 8 esi

            mov esi ARect
            mov edi BRect

            mov eax D$esi + TRect_Left
            mov ebx D$esi + TRect_Top
            mov ebp D$esi + TRect_Right
            mov edx D$esi + TRect_Bottom

            mov D$edi + TRect_Left eax
            mov D$edi + TRect_Top ebx
            mov D$edi + TRect_Right ebp
            mov D$edi + TRect_Bottom edx

          mov esi D$MemoryStack + 8
          mov edi D$MemoryStack + 4
          mov ebp D$MemoryStack
       sub ecx 1
       jnc @TestLoop6
    pop edi

    call TPerformanceCounter_TimeStampRemark TestRemark2

But the even more weird thing of all, is that in MY code, that uses eax and
edx, and push/pop them, is still faster, and that this code earn nothing or
very very little from converting thoose push/pops to memory moves...

call TPerformanceCounter_TimeStampRemark TestRemark6
       mov ecx NumberOfTests
       @TestLoop4:
          push eax edx
            mov eax ARect
            mov edx BRect

            ;left
            mov ebx D$eax + TRect_Left | mov D$edx + TRect_Left ebx
            ;top
            mov ebx D$eax + TRect_Top | mov D$edx + TRect_Top ebx
            ;right
            mov ebx D$eax + TRect_Right | mov D$edx + TRect_Right ebx
            ;bottom
            mov ebx D$eax + TRect_Bottom | mov D$edx + TRect_Bottom ebx

          pop edx eax
       sub ecx 1
       jnc @TestLoop4
    call TPerformanceCounter_TimeStampRemark TestRemark2

And below, the final code that uses a memory move instead of the stack

call TPerformanceCounter_TimeStampRemark TestRemark9
       mov ecx NumberOfTests
       @TestLoop7:
          mov D$MemoryStack eax
          mov D$MemoryStack + 4 edx
            mov eax ARect
            mov edx BRect

            ;left
            mov ebx D$eax + TRect_Left | mov D$edx + TRect_Left ebx
            ;top
            mov ebx D$eax + TRect_Top | mov D$edx + TRect_Top ebx
            ;right
            mov ebx D$eax + TRect_Right | mov D$edx + TRect_Right ebx
            ;bottom
            mov ebx D$eax + TRect_Bottom | mov D$edx + TRect_Bottom ebx

          mov edx D$MemoryStack + 4
          mov eax D$MemoryStack
       sub ecx 1
       jnc @TestLoop7
    call TPerformanceCounter_TimeStampRemark TestRemark2

All this info was given in the previous post, and also the timings, so back
and recap if you need to.

But the funny funniest thing is that the fastest code was the first hack I
wrote, that Randall (lol lol lol) told me was inefficient. He is a great
teacher dont you think ? ;-)

Of course. This is not proof. More test would have to be made, and it had to
be attacked from several directions to be considered proof. But its still
funny that such things happen, its means (just as MAbrash said in his book)
that measuring actuall code, is _absolutely_ needed, and that counting intel
cycle timings, is a bloody pointless waste of you time.

But then again, when all this is said and done, the best optimizer is not
this stuff, this is mostly nitpicking, for curriousity. Have probably zero
interesst as the next CPU from AMD may break it completly.

About RosAsm macros. For fun, I was thinking about creating my own private
stack, and redefine the push / pop macros in RosAsm ! They can most easily
be redefined like this (I use only DWORDS, whenever possible) :

[Push| add CustomStackPointer 4 | mov D$CustomStackPointer #1 ]
[Pop| mov #1 D$CustomStackPointer | sub CustomStackPointer 4]

Or something, maybe that was wrong actually! I will do this one of theese
days, and then TIME the creation and destruction of several millions of
objects, strings and memoryallocations, and then see if it makes any
diffrence to the timings. If it turns out that a custom memorystack is
faster then the normal stack....hehe, then I will start to laugh....

The code you write looks nice.



Relevant Pages