Re: Cost of calling a standard library function
From: The Half A Wannabee ("The)
Date: 03/02/04
- Next message: wolfgang kern: "Re: Best integer to string routines"
- Previous message: Bx.C: "Re: Best integer to string routines"
- In reply to: Beth: "Re: Cost of calling a standard library function"
- Next in thread: C: "Re: Cost of calling a standard library function"
- Reply: C: "Re: Cost of calling a standard library function"
- Reply: Beth: "Re: Cost of calling a standard library function"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 2 Mar 2004 21:25:39 +0100
"Beth" <BethStone21@hotmail.NOSPICEDHAM.com> wrote in message
news:lM01c.2285$GQ.452@newsfe1-win...
> The Half A Wannabee wrote:
> > Beth wrote:
> > Thank you for a very thorough and nice explanation Beth!
> Yeah, fair enough...like I said, I was being "illustrative" of what to
> try out and stuff, rather than it being 100% "cut and paste" "as is"
> code...I'm not particularly sure how to _read_ your timer results
> below, let alone think I know all about how that works ;)
I was a bit tired when I wrote this. I may have made some errors.
http://www.szmyggenpv.com/downloads/BethRect.Zip
Well, I handed it to you on a silverplatter. Didnt you see the link? Well,
whatever.
>> Testing 1_000_000 copyrects ala Beth_ 063208
> Are these results directly comparable to the previous results you had?
Very good question. The truth is, I didnt notice :-) Because when I studied
the function you worte, where you had rearranged the float and usage of the
registers, I found it so beautifully written, that I automatically
considered it to perform faster then my own code. Here are the two again, to
avoid the confusion of having to look past to the other posts: And now I see
the reason (I think) , and what you forgot (maybe) when you made that
streamlined procedure. It accesses/reads memory using esi 4 instructions in
a row, and then edi four instructions in a row. If you look at my proc,
you'll see (as I do now (= I did _not_ do it intentionally, it was just pure
luck ))- you will see that it writes to ebx, the reads from it, etc, while
your code keeps reading the memory through esi register 4 times in a row.
But (and now I am just speculating widely) while the memory for both the
rectangles are in the cache by now, the processor is probably
"understanding" that it doesnt need ebx to carry the value, while it can
safly move it within the cache, without having to go via ebx. Maybe this is
a totally wrong assumtion, but if it is, this should maybe be implemented?
So why cant it do it with the way you wrote your instructions ? Because you
use the same register to adress another _memory_ location in the instuction
imidiatly following, using the same register to spesify diffrent memory. But
this is just speculation on my part. It may be dead wrong.
(Later : ) Okey speculation will not help !!! I have now tested my register
arrangements, in an inline function, and now they are about the same. Mine
is insignificantly faster, but that is no more than variations in current or
workload of the OS.
(Later: )
Obs : Another error crept in, in your version of the inline approach using
ebp, I have failed to push/pop ebp _within_ the loop. That is needed. So
now, my inline approach is the best. And that was the real reason why your
inline procedure, using ebp was much faster. the extra push/pop of ebp its
whats causing it. Below is code for all the procs
1_000_000: (CopyRect)
Beth with stack__________: 63283
Beth NoStackCall_Eax_____: 39005
Beth inline without call_: 36066
Beth inline using ebp___ : 36457
Mine inline _______ : 27076
;Parameters are pointers to rectangles
Proc CopyRect:
Arguments @DestRect @SourceRect
mov eax D@SourceRect
mov edx D@DestRect
;left
mov ebx D$eax + TRect_Left | mov D$edx + TRect_Left ebx
;top
mov ebx D$eax + TRect_Top | mov D$edx + TRect_Top ebx
;right
mov ebx D$eax + TRect_Right | mov D$edx + TRect_Right ebx
;bottom
mov ebx D$eax + TRect_Bottom | mov D$edx + TRect_Bottom ebx
EndP
;Parameters are pointers to rectangles
Proc CopyRectBeth:
Arguments @DestRect @SourceRect
mov esi D@SourceRect
mov edi D@DestRect
mov eax D$esi + TRect_Left
mov ebx D$esi + TRect_Top
mov ecx D$esi + TRect_Right
mov edx D$esi + TRect_Bottom
mov D$edi + TRect_Left eax
mov D$edi + TRect_Top ebx
mov D$edi + TRect_Right ecx
mov D$edi + TRect_Bottom edx
EndP
;Parameters are pointers to rectangles
CopyRectBeth_v2:
;Arguments @DestRect @SourceRect
mov eax D$esi + TRect_Left
mov ebx D$esi + TRect_Top
mov ecx D$esi + TRect_Right
mov edx D$esi + TRect_Bottom
mov D$edi + TRect_Left eax
mov D$edi + TRect_Top ebx
mov D$edi + TRect_Right ecx
mov D$edi + TRect_Bottom edx
ret
[TestRemark1 : 'Beth with stack_____________________ :' 0]
[TestRemark3 : 'Beth NoStackCall_Eax________________ :' 0]
[TestRemark4 : 'Beth inline without call_____________:' 0]
[TestRemark5 : 'Beth inline using ebp/pop____________:' 0]
[TestRemark6 : 'Mine inline _________________ :' 0]
[TestRemark7 : 'Beth inline ebp only to memorystack :' 0]
[TestRemark8 : 'Beth with full MemoryStack__________ :' 0]
[TestRemark9 : 'Mine with full MemoryStack__________ :' 0]
[TestRemark2 : 'TimeStamp result____________________ :' 0]
;********************************************************************
call TPerformanceCounter_Create | mov D$PerformanceCounter edi
call TPerformanceCounter_Start
call TPerformanceCounter_TimeStampRemark TestRemark1
push edi
mov ecx 1_000_000
@TestLoop:
push ecx edi esi
call CopyRectBeth ARect BRect
pop esi edi ecx
sub ecx 1
jnc @TestLoop
mov eax edi
pop edi
call TPerformanceCounter_TimeStampRemark TestRemark2
call TPerformanceCounter_TimeStampRemark TestRemark3
push edi
mov ecx 1_000_000
@TestLoop1:
push ecx edi esi
mov esi ARect
mov edi BRect
mov eax CopyRectBeth_v2
call eax
pop esi edi ecx
sub ecx 1
jnc @TestLoop1
pop edi
call TPerformanceCounter_TimeStampRemark TestRemark2
call TPerformanceCounter_TimeStampRemark TestRemark4
push edi
mov ecx 1_000_000
@TestLoop2:
push ecx edi esi
mov esi ARect
mov edi BRect
mov eax D$esi + TRect_Left
mov ebx D$esi + TRect_Top
mov ecx D$esi + TRect_Right
mov edx D$esi + TRect_Bottom
mov D$edi + TRect_Left eax
mov D$edi + TRect_Top ebx
mov D$edi + TRect_Right ecx
mov D$edi + TRect_Bottom edx
pop esi edi ecx
sub ecx 1
jnc @TestLoop2
pop edi
call TPerformanceCounter_TimeStampRemark TestRemark2
call TPerformanceCounter_TimeStampRemark TestRemark5
push edi
mov ecx 1_000_000
@TestLoop3:
push edi esi ebp
mov esi ARect
mov edi BRect
mov eax D$esi + TRect_Left
mov ebx D$esi + TRect_Top
mov ebp D$esi + TRect_Right
mov edx D$esi + TRect_Bottom
mov D$edi + TRect_Left eax
mov D$edi + TRect_Top ebx
mov D$edi + TRect_Right ebp
mov D$edi + TRect_Bottom edx
pop ebp esi edi
sub ecx 1
jnc @TestLoop3
pop edi
call TPerformanceCounter_TimeStampRemark TestRemark2
call TPerformanceCounter_TimeStampRemark TestRemark6
mov ecx 1_000_000
@TestLoop4:
push eax edx
mov eax ARect
mov edx BRect
;left
mov ebx D$eax + TRect_Left | mov D$edx + TRect_Left ebx
;top
mov ebx D$eax + TRect_Top | mov D$edx + TRect_Top ebx
;right
mov ebx D$eax + TRect_Right | mov D$edx + TRect_Right ebx
;bottom
mov ebx D$eax + TRect_Bottom | mov D$edx + TRect_Bottom ebx
pop edx eax
sub ecx 1
jnc @TestLoop4
call TPerformanceCounter_TimeStampRemark TestRemark2
call TPerformanceCounter_Stop &FALSE
push edi esi
mov eax D$gDesktopDir
push eax
call FileManager_ComposeFullFileNameEAX D$eax + TString_Pchar
PerformanceFileName
push eax
Call TPerformanceCounter_SaveToTextFile D$eax + TString_Pchar
pop eax
call StringManager_DisposeString
pop eax
call StringManager_DisposeString
pop esi edi
Call TPerformanceCounter_Destroy
> Mind you, I don't know how this timer thing works exactly or what
> units we're measuring here (TSC units?) or whatever...hence,
Darling, it was in the link. A Zipfile with you _name_ on it, and you
couldn't see it ? It carried all the code, and a full exe to run it.(By the
time you read this, I have uploaded a modifed version, the FULL current one
to the same exact link, an update)
_IF_ the
> results aren't directly comparable, then the above doesn't
> apply...perhaps put them all into a program to get comparable results?
I just made test for _your_ code, but all the code _is_ in the exe.
Rightclicking on the rectangle code will bring you right to the place where
all, now 5 versions of copyrect is located. Two versions are inline, so its
actually 7/8 versions.
>
> Anyway, just doing a "Sherlock" on the results...even if the results
> are better overall, there could still be "issues" with it, if some of
> the results aren't going down as expected in the right places...
>
> > Testing 1_000_000 copyrects ala Beth no stack__________ :063228 003
> > TimeStamp result_______________________________________ :102296
> 039068
>
Sometimes Beth. I dont know if I am totally clueless, and you are just
hinting me. (which is most likely) or if you're just tired..Or maybe that
would be me. Sometimes Beth, I have this sneaking suspision that you
downloaded and read the code, like the back of your hand, took a long yawn,
wrote a novell, surfed 100 political sites on the net, read all the
newspapers in the world, espesially the German once and then between to
bites of a chocolate bar, wrote this mail with you left hand, while your
right hand was typing out the spesifications of LuxAsm and ConvInc ;-)
> And this result - a 61% increase - I think starts to explain why one
> point I always make about HLLs and calling conventions is that bloody
> stack!! Especially because, if you think over the logic, as it's not a
> recursive procedure, then stuffing ESI and EDI onto the stack only to
> pull them straight back off of it is a physically _redundent_
> action...you're copying something onto the stack only to take it back
> off again and stick it into a register (which, worse, could be the
> exact same register we copied from, meaning that we're copying a value
> via the stack back to where it already is, anyway!! ;)...
>
> HLLs do the stack thing because it's "recursive safe"; _IF_ this
> procedure were to be recursed then the stack makes perfect sense as a
> means to keep track of things...although, generally, this is usually
> _NOT_ the case with a large amount of code...hence, we actually tend
> to have what, in the main, is a inappropriate "default" (but the
> reasons for that make perfect sense that, really, for a HLL compiler,
> it _couldn't_ feasibly be any other way ;)...
Well. Delphi does not create a stackframe, unless more then 3 registers are
used. In the case of a method (as in objects) it will only have 2 free
registers at any time, and creates a stackframe if more than 2 registers are
beeing used to pass parameters. There are also a few exceptions and/or extra
rules.. to these rules.
> This is a grand problem with "generalisations" like this...the stack
> method is always "safe" for all types...if it's recursive, the stack
> is made to measure (even if you didn't use the stack for the parameter
> passing itself then it would be a natural choice for PUSHing and
> POPping your variables so that they don't conflict with each other
> ;)...if it's NOT recursive, then the method _still_ works, it's just
> there's a whole bunch of code in your program that's effectively one
> big "NOP", if you will...it's spending all this time copying values
> from registers to stack and back again, making you wonder "can't we
> just leave it in the register in the first place...or, at most, use a
> 'mov' instruction to get it from one register to another one, expected
> by the procedure?"...
I have took your advice, and for those procs where I need most speed, I have
removed the stack. Like in the case of allocation memory and stuff. But I
have more to do with this. I will have to look over all of my code, and try
to use mov x, y mov y,x instead of push/pops. But its nice to use the stack
sometimes, as it provide extra undeclared space, but maybe a memory variable
is just as fast ? Havent tried that. I will now try that for your ebp-inline
code and see whats faster.
OMG: Using a memorystack....is faster !!! OMG!!! It wins 6000 TICS on
push/pop ebp !!!
1_000_000 : (copyrects)
Beth with stack_____________________ :64107
Beth NoStackCall_Eax________________ :39066
Beth inline without call____________ :36048
Beth inline using ebp/pop___________ :36187
Beth inline ebp only to memorystack :30011
Mine inline _________________ :27079
here is the code for you inline using ebp away in a memoryvariable:
[MemoryStack: &NULL]
call TPerformanceCounter_TimeStampRemark TestRemark7
push edi
mov ecx 1_000_000
@TestLoop5:
mov D$MemoryStack ebp
push edi esi
mov esi ARect
mov edi BRect
mov eax D$esi + TRect_Left
mov ebx D$esi + TRect_Top
mov ebp D$esi + TRect_Right
mov edx D$esi + TRect_Bottom
mov D$edi + TRect_Left eax
mov D$edi + TRect_Top ebx
mov D$edi + TRect_Right ebp
mov D$edi + TRect_Bottom edx
pop esi edi
mov ebp D$MemoryStack
sub ecx 1
jnc @TestLoop5
pop edi
call TPerformanceCounter_TimeStampRemark TestRemark2
Now this was REALLY interesting. Thanks Beth !!! I would have never thought
of that if it wore not for you, and this discussion. Its so great to have
this conversation. I take back all the bad things I said about you (in case
I did). This is really really fruitful. Now lets replace all the push/pops
with memorystacks and time again
1_000_000:
Beth with stack_____________________ :016 016
TimeStamp result____________________ :063265 063249
Beth NoStackCall_Eax________________ :063269 004
TimeStamp result____________________ :102397 039128
Beth inline without call_____________:102401 004
TimeStamp result____________________ :138930 036529
Beth inline using ebp/pop____________:138934 004
TimeStamp result____________________ :175019 036085
Beth inline ebp only to memorystack :175022 003
TimeStamp result____________________ :205012 029990
Beth with full MemoryStack__________ :205016 004
TimeStamp result____________________ :233203 028187
Mine inline _________________ :233207 004
TimeStamp result____________________ :260716 027509
Mine with full MemoryStack__________ :260720 004
TimeStamp result____________________ :287803 027083
SAME THING SECOND RUN (There are variations)
Beth with stack_____________________ :014 014
TimeStamp result____________________ :063432 063418
Beth NoStackCall_Eax________________ :063436 004
TimeStamp result____________________ :102464 039028
Beth inline without call_____________:102468 004
TimeStamp result____________________ :138537 036069
Beth inline using ebp/pop____________:138541 004
TimeStamp result____________________ :174651 036110
Beth inline ebp only to memorystack :174654 003
TimeStamp result____________________ :205577 030923
Beth with full MemoryStack__________ :205581 004
TimeStamp result____________________ :232645 027064
Mine inline _________________ :232648 003
TimeStamp result____________________ :259661 027013
Mine with full MemoryStack__________ :259664 003
TimeStamp result____________________ :286737 027073
A LAST TIME WITH 100_000_000 iterations : I have simplified the output, just
the results
100_mill iterations:
Beth with stack_____________________ :014 014
TimeStamp result____________________ :006325569 006325555
Beth NoStackCall_Eax________________ :006325574 005
TimeStamp result____________________ :010241554 003915980
Beth inline without call_____________:010241561 007
TimeStamp result____________________ :013853006 003611445
Beth inline using ebp/pop____________:013853013 007
TimeStamp result____________________ :017463246 003610233
Beth inline ebp only to memorystack :017463251 005
TimeStamp result____________________ :020478069 003014818
Beth with full MemoryStack__________ :020478077 008
TimeStamp result____________________ :023238995 002760918
Mine inline _________________ :023239000 005
TimeStamp result____________________ :025946079 002707079
Mine with full MemoryStack__________ :025946083 004
TimeStamp result____________________ :028657013 002710930
HMMM!!!! I am not sure I understand theese timings! It would almost seem
that ebp is more costly to push than other registers, and that edi edi is
more costly to push that eax edx ??????? No I am to tired (or stupid). Maybe
a more advance asm programmer would care to comment ? Randy this is you
call....? OR better, Wolfgang,Betov or somebody ;-)
>
> In terms of "safety", this stuff makes sense and HLL compilers and HLL
> calling conventions basically _must_ do something like this, as it's
> simply unacceptable - it would lead to broken code that could not
> actually be fixed without manually editing the ASM output for the
> code - for the compiler to generate "unsafe" code...this can be
> understood (you don't have to like it to understand why it's done,
> after all ;) why the automated solution prefers not to do this...
>
> But in terms of how much code is actually recursive to need this
> default? Generally speaking, recursion is an exception and not the
> rule...so, we understand why we have this "default" perhaps but, in
> actual practice, it's just adding a "big NOP" onto all the procedures
> that, on average, doesn't need it...and seeing as even for this simple
> procedure with only two parameters, we've won a 61% increase then you
> kind of start to get the picture...most especially because in HLL
> programming, there's an awful lot _more_ calling procedures involved
> (practically _everything_ gets done by calling procedures, as HLL
> programming with libraries is very often "jigsaw puzzle programming",
> as I like to call it...you're just putting the "jigsaw pieces"
> together to make the picture you want but you're usually not the one
> creating most of the actual "jigsaw pieces" :)...
Yes ! To a certain degree this is true. Allthough its even worse for Object
oriented languages. They spend even more overhead, in getting the objects
into FOV. That is, just like I must sometimes do, the way I choose to go
about with RosAsm, I have choosen edi as the object register. But unlike
HLLs I can much more easily structure my code, where needed; to better use
them. In say a loop, if you call some objects method, most likely if there
are more then one object, and even if there isnt, edi or esi (= the object
dedicated register) get reloaded, every time through the loop, and worse, it
can even happen many times within the same loop, for the same object. In
fact, because of the deterministic nature of compilers, they are completly
unable to cope with this is a good manner, and OOP writers, must handcode in
basm or casm to cope with this. But this is so much trouble, that only asm
writer material will know about it, and the writes will not do this
everywhere, as that would destroy the socalled "ease of use" theese tools
claim to be. This problem is imo a much worse situation for OOP languages.
Useing assembly, even a beginner like myself, can see whats going on, and
move the assignment pointers outside the loop, and keep them steadily
pointing at the object that needs to be adressed, or even better discard the
loop totally and use another approach. As said, it can be done in HLL, but
then the whole point of wirting HLL will seize to have its 'purposed"
advantage. Knowing that information hiding is advocated by oop, what do you
know about whats happening _inside_ a method ? Maybe the method is itself a
loop, where multiple objects get reloaded again and again ? You know, its
not for nothing that Betov is so strict about his monofile arcitecture. Its
just that you have to suffer a long long time until you get angry enough and
willing enough to admit that the "easy" way is the really the hard way.
I answer the rest of your post another time. Hope you dont mind. I am tired
:-) It was very nice to read it.
>
> This is a simple win for ASM here; We don't have to obey any "ivory
> tower" calling conventions designed with "genericism" to every single
> procedure possible...it is an acceptable thing here to snip this stuff
> out...and, as I've noted before when I was talking about how an OS or
> library should always take the _lowest-level_ interface possible, this
> does NOT stop HLLs from using these procedures...a set of "wrappers"
> can be made which _does_ obey the "calling conventions" and drops the
> parameters from the stack - per the HLL convention - into the right
> registers and calls the actual low-level procedure...and this is what
> I mean when I say "portability is a _higher level_ concern"...the
> procedure should simply do its job with the least "fuss"
> possible..._IF_ some "portability" stuff is required with some HLL
> "calling convention" then it's possible to simply build a higher level
> set of "wrappers" that do this "portability" work, calling into the
> real function...of course, then the "stack" stuff _does_ add these
> "big NOPs" and lesser performance but then you're coding a HLL, you
> want portability, etc....in other words, you're choosing to pay the
> price for that stuff...and you would be paying it using specially
> written HLL versions of the functions, anyway...extra cost for this
> approach? One extra "CALL" instruction (which is an unconditional
> branch that the CPU should be "absorbing" the cost with its caching
> and so forth...both wrapper and procedure are user code - same
> priviledge - so there won't be any "transition" costs as you might get
> with a OS API function ;)...the parameter stuff - what gets moved
> where - actually balances up to what would basically be involved with
> a procedure that has the HLL stuff hard-coded into it...but doing it
> this way provides the _choice_: Don't care for "portability"? Then
> improve your performance with an "INT 80h" style call, to borrow Linux
> as an example of this method...do care? Important to your program?
> Then call into the C "wrapper" instead...both does the same thing -
> exact same code actually does the work - but there's a _choice_ into
> how you want to access it...
>
> The issue here is, in a sense, a confusion between "value" and
> "variable"...a variable has a value but the two aren't quite the same
> thing...passing things via the stack is thinking in "variable"
> terms...realising that you can avoid it in many, many cases is
> thinking in "value" terms, so to speak...symbolically, the variable is
> more than merely its value so it has to be copied via
> stacks...logically and physically, though, the value _IS_ the
> variable...it's NOT the memory address that the value is stored at for
> recall, it's NOT the register you load it into, it's NONE of these
> things...but it takes a different mindset to see it: the _VALUE_ is
> the "variable"...you're just "storing" the value at a memory address
> for later recall (not enough registers to store everything there so
> pop it into RAM :)...you're just "manipulating" the _value_ when it's
> in a register and opcodes are applied to it...
>
> It's another one of these places where "abstraction" can temporarily
> blind people...as long as the _values_ are all making sense with
> regard to program logic, then _how_ this gets done is actually not
> particularly important...but when you look at a "variable" in its
> symbolic context, then the _storage_ is made the most important
> thing...symbolically, "VarA" is the contents (MASM "offset" style) or
> address (NASM _actually consistent_ "square brackets" style ;) of the
> _storage_...
>
> The difference here is to recognise the "arithmetic" under our
> "algebra", so to speak...algebra is similarly a "symbolic" abstraction
> of the arithmetic...but a symbolic view - though it's capable of
> providing perspectives that couldn't even be seen without the
> abstraction (like Pythogoras' theorem, straight line equations, etc.
> ;) - can sometimes mislead from the arithmetic underneath...for
> example, in algebra, you can't reduce "2x + 2y" because _symbolically_
> we can mix our abstraction...arithmetically, though, if "x = 2" and "y
> = 3" then nothing whatsoever stops us adding them together...and if "x
> = 0" then the "2x" bit doesn't actually exist arithmetically (anything
> multiplied by zero is zero so that term simply "vanish" arithmetically
> :), anyway...
>
> So, am I saying "don't use algebra"? ABSOLUTELY NOT! Its power and
> flexibility and the clarity that the abstraction can bring are
> incredibly useful...their power has proved itself over the centuries,
> I think, that there's _NO DOUBT_ about this power at all...so, does
> this mean "never look at the arithmetic"? Ah, also "No!" but this is
> one of the problems with all types of abstraction - be that HLLs, be
> that device drivers, be that algebra, be that _ANY_ abstraction - that
> you abstract away the underlying "irrelevent details" to aid your
> focus...that power is important and often amazing (arguably, it is
> this sole ability that makes human intelligence what it is...we're all
> geniuses at this basic ability to amazing levels, such as, for
> example, being able to comprehend squiggly shapes on your monitor and
> actually converting that into English - which, for non-native English
> speakers, might be being converted _via_ some other "default" language
> in their minds - and then into abstract thoughts in your brain about
> what I'm saying here :)...
>
> But, in tribute to the first female Oscar nomination for directing
> happening at the Oscars, as was the name of her film, some things can
> often get "Lost in Translation"...fundamentally, ASM can often improve
> a program in completely non-trivial "leaps and bounds" on speed, size,
> algorithm, etc. and, ultimately, the lone way it manages to do this is
> because it does NOT "translation" so suffers no "loss"...and,
> amazingly, that's all it ultimately is...which is why this point
> _isn't_ as trivial as it may first appear...you know, "value?
> Variable? Symbolic this? Logical that? What on Earth is she going on
> about? There's no difference...or only a subtle inconsequential
> difference, at most"...and, in a sense, that's right..._EXCEPT_ the
> "inconsequential" part..."small" or "simple" does not necessarily mean
> "inconsequential"...
>
> "When I see the tremendous consequences that can come from small
> things I am tempted to think...that there are NO small things"
> [ Bruce Barton ]
>
> > Testing 1_000_000 copyrects inline without call _______ :102300 004
> > TimeStamp result_______________________________________ :138317
> 036017
>
> Hey, that demonstrates just how well the whole cache / pipeline /
> prefetch stuff actually works on modern CPUs...the "CALL" is being in
> large part "absorbed" here by the stuff on modern processors that
> pre-fetches code ahead of time ("CALL" is a branch but it's
> unconditional that it can grab it ahead of time, no problems :)...I
> suppose I am thinking in an "out of date" way (hey, I was programming
> 286s and 386s and other non-PC stuff before "pipeline" was properly
> implemented!! ;) to suggest that it would make that great a difference
> on modern CPUs...
>
> The architectural improvements made by CPU manufacturers have,
> unsurprisingly, concentrated on "absorbing" as much penalty from this
> stuff as feasible...which is actually incredibly useful because with
> most things being programmed in HLLs and modern OS architectures
> moving towards _everything_ being done via API procedures, this stuff
> is most certainly needed...look how much we're losing on the parameter
> passing stuff via the stack already (though, as parameters by their
> nature can be any amount, any type, any order, etc. there's not much
> CPU manufacturers can do to generalise optimising this
> stuff...although, if they did, then the improvement would, once again,
> prove to be dramatic :)...so, as much as can be stripped off the
> actual "CALL" is welcome...
>
> One interesting "educational" thing you could do is change the "CALL
> CopyRect" to a "mov eax, offset CopyRect; CALL [ eax ]" (register not
> important, just an example and the syntax "generic" :)...the idea
> being to force the CPU to not be able to "prefetch" the code because
> it doesn't know where it is until the "MOV" instruction loads in the
> correct address...it's not just the extra instruction and the
> indirection which should prove bothersome to the performance here...we
> should also begin to see how much is "absorbed" by CPU "pipeline"
> architecture...note that this only works to try to cut out the
> "pipeline" on the CALL itself, the rest is still working that
> way...but older machines had no form of "cache" or "pipelining" at all
> on anything and the difference that this "factory production line"
> idea makes is easy to underestimate...for instance, _ALL_ instructions
> would take multiple clocks (simply a case that a clock was wasted
> fetching, another one decoding, another few if it needed some RAM or
> something, etc. and this was before the instruction was even executed
> :) and the concept of "1 clock" or even "for free" instructions was
> totally unheard of...
>
> There is some truth to "Look at how much better CPUs are today! Look
> at those amazing CPU speeds! There's no need to optimise
> anymore!"...the part where it falls down is the "no need to optimise
> anymore" bit...if you don't still try to optimise, then these
> improvements are about making your unoptimised code run like optimised
> code used to do...which is a great thing...until you consider that if
> you _do_ optimise, then all the CPU improvements are now going
> directly into a _better program_...kind of like saying: "the invention
> of the car rather than walking everywhere has greatly improved things
> that now I can drive my car with _no effort_ (at 3mph, about the same
> speed I would walk :) to my destination!"...umm, yeah, you _can_ do
> what you would have done walking but with less effort, that's true
> enough...but you're kind of missing the point that your car can go on
> a motorway (US: highway, Germany: Autobahn, etc. :) and travel at
> 70mph or more and, in the same time it would have taken to walk, you
> can cover immense distances...or, alternatively, go the same distance
> at a faster speed, to get someone in less time and so on...it might
> take a half-day to walk to a fairly distant town but a jet plane can
> get you to the other side of the world(!!) in the same amount of
> time...you're kind of slightly _missing the point_ of why these
> "optimisations" were made if all you think they are about is allowing
> you to write exactly the same program but with less effort, as you can
> "bloat" it to extreme levels - hundreds of MBs of RAM, GBs of hard
> drive space, etc. - and still happily market it as...ooh...an
> "operating system" or something...naming no names, as it's so obvious
> who we're talking about, I'd be highly patronising to you to say ;)...
>
> > Testing 1_000_000 copyrects inline using epb _______ :138321 004
> > TimeStamp result_______________________________________ :165393
> 027072
>
> The reason for this here should be slightly obvious; Without
> "borrowing" EBP, then ECX gets used to a double purpose and we have to
> shuffle two values back and forth in ECX for every iteration...leave
> ECX as the counter and just "borrow" a "spare" register and then we
> can use one register each for the two purposes...no "shuffling"
> required...so all the time spent on that? Totally vanishes and we get
> that time back to actually processing things rather than "management
> overhead"...
>
> > So the last version is the fastest, as expected, but the important
> > improvement seem to have been made from removing the stackframe,
> while the
> > call removal bought us very little.
>
> The "CALL removal" doesn't buy much on modern processors because the
> things actually "absorb" a lot of the cost with caches / pipelining /
> prefetching...this doesn't make as dramatic a difference as it used
> to...but, then again, it is _still_ an "improvement" even if not a
> dramatic one like the others...CPU optimisation is very good these
> days but then if you don't give it anything that requires
> "optimisation" in the first place then that's better, even if not by
> as a great amount as other things...also, if we're looking at this
> relative to the older machines which didn't do this stuff, then even
> this apparently "small" difference is proportionally impressive
> because the whole CPU is already running perhaps a hundred or so times
> faster than those machines...hence, a "clock cycle" is, in real-world
> times, a much smaller unit of measurement...
>
> Or, in other words, what we've saved here - taking software _and
> hardware_ over time into account - is enough time to do some pretty
> amazing _real-world_ things...because, in the end, this is the
> ultimate measure: the _practical_ one of what can get done usefully by
> the machine in the least amount of resources...
>
> That is to say, what we've accomplished in optimising this code with
> simple little tricks and optimisations might, indeed, be equivalent to
> the _entire processing power_ of some of those much older
> non-pipelined machines...and, hey, they were _perfectly capable_ of
> doing some mightily impressive stuff even with such limited resources
> (I mean, you can get to the Moon with only 32KB of RAM!! ;)...
>
> > The use of ebp gave us quite a bit.
>
> Avoids pointless shuffling of registers when EBP can be used...note
> that to make EBP "free" for this kind of thing, you'll have to say
> bye-bye to those "stack frames" as they use EBP...but saying bye-bye
> to stack frames was also the thing making the most dramatic
> difference, anyway, so it's clear which direction to head here, right?
> ;)
>
> Beth :)
>
>
>
- Next message: wolfgang kern: "Re: Best integer to string routines"
- Previous message: Bx.C: "Re: Best integer to string routines"
- In reply to: Beth: "Re: Cost of calling a standard library function"
- Next in thread: C: "Re: Cost of calling a standard library function"
- Reply: C: "Re: Cost of calling a standard library function"
- Reply: Beth: "Re: Cost of calling a standard library function"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]