Re: Cost of calling a standard library function
From: The Half A Wannabee ("The)
Date: 03/03/04
- Next message: Jim Carlock: "Re: The Great Debate V. What have changed ?"
- Previous message: Beth: "Re: Cost of calling a standard library function"
- In reply to: C: "Re: Cost of calling a standard library function"
- Next in thread: C: "Re: Cost of calling a standard library function"
- Reply: C: "Re: Cost of calling a standard library function"
- Reply: Beth: "Re: Cost of calling a standard library function"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 3 Mar 2004 16:04:05 +0100
"C" <blackmarlin@asean-mail.com> wrote in message
news:33d97ee5.0403030441.2fa1780c@posting.google.com...
> "The Half A Wannabee" <ShakainZulu_AT(Do you See me
now?#0)_HotIQForReal.NET.com> wrote in message
news:<4044eefd$1@news.broadpark.no>...
>
> > HMMM!!!! I am not sure I understand theese timings!
> > It would almost seem that ebp is more costly to push
> > than other registers, and that edi edi is more costly
> > to push that eax edx ??????? No I am to tired (or
> > stupid). Maybe a more advance asm programmer would
> > care to comment ? Randy this is you call....? OR
> > better, Wolfgang,Betov or somebody ;-)
>
> EBP is not actually more costly to push, but there can
> be circumstances where it generates stalls making it
> appear more costly. I assume you are creating and
> destorying a stack from something like...
>
> PUSH EBP
> MOV EBP, ESP
> ; your code
> MOV ESP, EBP
> POP EBP ; { AGI stall }
> RET
No need to assume. The code was in my mail : Here it is again. (below)
This is Beth's solution for an inline copyrect (no call). So NO stack frame
(due to a call) is present at all. (I am fudging poet, and now I am a lun)
call TPerformanceCounter_TimeStampRemark TestRemark5
push edi
mov ecx 1_000_000
@TestLoop3:
push edi esi ebp
mov esi ARect
mov edi BRect
mov eax D$esi + TRect_Left
mov ebx D$esi + TRect_Top
mov ebp D$esi + TRect_Right
mov edx D$esi + TRect_Bottom
mov D$edi + TRect_Left eax
mov D$edi + TRect_Top ebx
mov D$edi + TRect_Right ebp
mov D$edi + TRect_Bottom edx
pop ebp esi edi
sub ecx 1
jnc @TestLoop3
pop edi
call TPerformanceCounter_TimeStampRemark TestRemark2
Below I repeat the code that was 6000 TICK faster than the above. Still
Beths code, only now esp does not get pushed/poped, but saved away in a
memoryvariable.
(my computer generates 3579545 ticks per second).
[MemoryStack: &NULL]
call TPerformanceCounter_TimeStampRemark TestRemark7
push edi
mov ecx 1_000_000
@TestLoop5:
mov D$MemoryStack ebp
push edi esi
mov esi ARect
mov edi BRect
mov eax D$esi + TRect_Left
mov ebx D$esi + TRect_Top
mov ebp D$esi + TRect_Right
mov edx D$esi + TRect_Bottom
mov D$edi + TRect_Left eax
mov D$edi + TRect_Top ebx
mov D$edi + TRect_Right ebp
mov D$edi + TRect_Bottom edx
pop esi edi
mov ebp D$MemoryStack
sub ecx 1
jnc @TestLoop5
pop edi
So there should occur no spesial AGI stall because of this? As I see it, it
"proves" that push/pop is more costly than move to/from memory. This makes
sense, since push Allocates memory, and pop deallocates it. But still one
should hope the spesial processor instruction would do it faster.
But this is not the weirdest thing. The weird thing is, that in the code
below, where all push/pops are made into memory moves, for a total of 3/6
memory moves compared to originally 3 pushes, is not faster by 12000 tics,
but only by less then < 2000. So while changing push ebp to mov mem32 ebp
got us 6000 ticks in total, the replacements of edi AND esi pushes, got us
only another 2000 ticks.
[NumberOfTests = 1_000_000]
and (in the final results)
[NumberOfTests = 100_000_000]
call TPerformanceCounter_TimeStampRemark TestRemark8
push edi
mov ecx NumberOfTests
@TestLoop6:
mov D$MemoryStack ebp
mov D$MemoryStack + 4 edi
mov D$MemoryStack + 8 esi
mov esi ARect
mov edi BRect
mov eax D$esi + TRect_Left
mov ebx D$esi + TRect_Top
mov ebp D$esi + TRect_Right
mov edx D$esi + TRect_Bottom
mov D$edi + TRect_Left eax
mov D$edi + TRect_Top ebx
mov D$edi + TRect_Right ebp
mov D$edi + TRect_Bottom edx
mov esi D$MemoryStack + 8
mov edi D$MemoryStack + 4
mov ebp D$MemoryStack
sub ecx 1
jnc @TestLoop6
pop edi
call TPerformanceCounter_TimeStampRemark TestRemark2
But the even more weird thing of all, is that in MY code, that uses eax and
edx, and push/pop them, is still faster, and that this code earn nothing or
very very little from converting thoose push/pops to memory moves...
call TPerformanceCounter_TimeStampRemark TestRemark6
mov ecx NumberOfTests
@TestLoop4:
push eax edx
mov eax ARect
mov edx BRect
;left
mov ebx D$eax + TRect_Left | mov D$edx + TRect_Left ebx
;top
mov ebx D$eax + TRect_Top | mov D$edx + TRect_Top ebx
;right
mov ebx D$eax + TRect_Right | mov D$edx + TRect_Right ebx
;bottom
mov ebx D$eax + TRect_Bottom | mov D$edx + TRect_Bottom ebx
pop edx eax
sub ecx 1
jnc @TestLoop4
call TPerformanceCounter_TimeStampRemark TestRemark2
And below, the final code that uses a memory move instead of the stack
call TPerformanceCounter_TimeStampRemark TestRemark9
mov ecx NumberOfTests
@TestLoop7:
mov D$MemoryStack eax
mov D$MemoryStack + 4 edx
mov eax ARect
mov edx BRect
;left
mov ebx D$eax + TRect_Left | mov D$edx + TRect_Left ebx
;top
mov ebx D$eax + TRect_Top | mov D$edx + TRect_Top ebx
;right
mov ebx D$eax + TRect_Right | mov D$edx + TRect_Right ebx
;bottom
mov ebx D$eax + TRect_Bottom | mov D$edx + TRect_Bottom ebx
mov edx D$MemoryStack + 4
mov eax D$MemoryStack
sub ecx 1
jnc @TestLoop7
call TPerformanceCounter_TimeStampRemark TestRemark2
All this info was given in the previous post, and also the timings, so back
and recap if you need to.
But the funny funniest thing is that the fastest code was the first hack I
wrote, that Randall (lol lol lol) told me was inefficient. He is a great
teacher dont you think ? ;-)
Of course. This is not proof. More test would have to be made, and it had to
be attacked from several directions to be considered proof. But its still
funny that such things happen, its means (just as MAbrash said in his book)
that measuring actuall code, is _absolutely_ needed, and that counting intel
cycle timings, is a bloody pointless waste of you time.
But then again, when all this is said and done, the best optimizer is not
this stuff, this is mostly nitpicking, for curriousity. Have probably zero
interesst as the next CPU from AMD may break it completly.
About RosAsm macros. For fun, I was thinking about creating my own private
stack, and redefine the push / pop macros in RosAsm ! They can most easily
be redefined like this (I use only DWORDS, whenever possible) :
[Push| add CustomStackPointer 4 | mov D$CustomStackPointer #1 ]
[Pop| mov #1 D$CustomStackPointer | sub CustomStackPointer 4]
Or something, maybe that was wrong actually! I will do this one of theese
days, and then TIME the creation and destruction of several millions of
objects, strings and memoryallocations, and then see if it makes any
diffrence to the timings. If it turns out that a custom memorystack is
faster then the normal stack....hehe, then I will start to laugh....
The code you write looks nice.
- Next message: Jim Carlock: "Re: The Great Debate V. What have changed ?"
- Previous message: Beth: "Re: Cost of calling a standard library function"
- In reply to: C: "Re: Cost of calling a standard library function"
- Next in thread: C: "Re: Cost of calling a standard library function"
- Reply: C: "Re: Cost of calling a standard library function"
- Reply: Beth: "Re: Cost of calling a standard library function"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|