Re: Cost of calling a standard library function
From: Beth (BethStone21_at_hotmail.NOSPICEDHAM.com)
Date: 03/02/04
- Previous message: The Half A Wannabee: "The Great Debate V. What have changed ?"
- Maybe in reply to: hutch--: "Re: Cost of calling a standard library function"
- Next in thread: The Half A Wannabee: "Re: Cost of calling a standard library function"
- Reply: The Half A Wannabee: "Re: Cost of calling a standard library function"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 2 Mar 2004 14:16:42 -0000
The Half A Wannabee wrote:
> Beth wrote:
> Thank you for a very thorough and nice explanation Beth! the strange
thing
> is, even though I have read all this before, its very existing to
read
> about. Almost sexy. I have set up 4 versions of you code and timed
them, but
> for the last two version I left the push edi esi calls, as theese
are
> absolutely required in reality, so it would be unfair to leave that
out, as
> the code would be artificial without them.
Yeah, fair enough...like I said, I was being "illustrative" of what to
try out and stuff, rather than it being 100% "cut and paste" "as is"
code...I'm not particularly sure how to _read_ your timer results
below, let alone think I know all about how that works ;)
> here are the results : The important number is in the right column
at the
> lines that says result.
>
> Testing 1_000_000 copyrects ala Beth___________________ :017 017
> TimeStamp result_______________________________________ :063225
063208
Are these results directly comparable to the previous results you had?
Because, in fact, this is a higher number than your version in the
previous set of results...hence, _if_ this is directly comparable,
then I wouldn't have expected that...in which case, this suggests I
_may_ have fallen foul of some "optimisation rule" and am getting a
penalty because of it...so, that _would_ be something to look into -
if, as I say, these results are directly comparable - because there
might be something that could be done to improve things...and that's
even possibly true on the results below which are doing better still
because, though faster from dropping the stack and using EBP, they
could _still_ be getting this penalty, if there is one...
Mind you, I don't know how this timer thing works exactly or what
units we're measuring here (TSC units?) or whatever...hence, _IF_ the
results aren't directly comparable, then the above doesn't
apply...perhaps put them all into a program to get comparable results?
Anyway, just doing a "Sherlock" on the results...even if the results
are better overall, there could still be "issues" with it, if some of
the results aren't going down as expected in the right places...
> Testing 1_000_000 copyrects ala Beth no stack__________ :063228 003
> TimeStamp result_______________________________________ :102296
039068
And this result - a 61% increase - I think starts to explain why one
point I always make about HLLs and calling conventions is that bloody
stack!! Especially because, if you think over the logic, as it's not a
recursive procedure, then stuffing ESI and EDI onto the stack only to
pull them straight back off of it is a physically _redundent_
action...you're copying something onto the stack only to take it back
off again and stick it into a register (which, worse, could be the
exact same register we copied from, meaning that we're copying a value
via the stack back to where it already is, anyway!! ;)...
HLLs do the stack thing because it's "recursive safe"; _IF_ this
procedure were to be recursed then the stack makes perfect sense as a
means to keep track of things...although, generally, this is usually
_NOT_ the case with a large amount of code...hence, we actually tend
to have what, in the main, is a inappropriate "default" (but the
reasons for that make perfect sense that, really, for a HLL compiler,
it _couldn't_ feasibly be any other way ;)...
This is a grand problem with "generalisations" like this...the stack
method is always "safe" for all types...if it's recursive, the stack
is made to measure (even if you didn't use the stack for the parameter
passing itself then it would be a natural choice for PUSHing and
POPping your variables so that they don't conflict with each other
;)...if it's NOT recursive, then the method _still_ works, it's just
there's a whole bunch of code in your program that's effectively one
big "NOP", if you will...it's spending all this time copying values
from registers to stack and back again, making you wonder "can't we
just leave it in the register in the first place...or, at most, use a
'mov' instruction to get it from one register to another one, expected
by the procedure?"...
In terms of "safety", this stuff makes sense and HLL compilers and HLL
calling conventions basically _must_ do something like this, as it's
simply unacceptable - it would lead to broken code that could not
actually be fixed without manually editing the ASM output for the
code - for the compiler to generate "unsafe" code...this can be
understood (you don't have to like it to understand why it's done,
after all ;) why the automated solution prefers not to do this...
But in terms of how much code is actually recursive to need this
default? Generally speaking, recursion is an exception and not the
rule...so, we understand why we have this "default" perhaps but, in
actual practice, it's just adding a "big NOP" onto all the procedures
that, on average, doesn't need it...and seeing as even for this simple
procedure with only two parameters, we've won a 61% increase then you
kind of start to get the picture...most especially because in HLL
programming, there's an awful lot _more_ calling procedures involved
(practically _everything_ gets done by calling procedures, as HLL
programming with libraries is very often "jigsaw puzzle programming",
as I like to call it...you're just putting the "jigsaw pieces"
together to make the picture you want but you're usually not the one
creating most of the actual "jigsaw pieces" :)...
This is a simple win for ASM here; We don't have to obey any "ivory
tower" calling conventions designed with "genericism" to every single
procedure possible...it is an acceptable thing here to snip this stuff
out...and, as I've noted before when I was talking about how an OS or
library should always take the _lowest-level_ interface possible, this
does NOT stop HLLs from using these procedures...a set of "wrappers"
can be made which _does_ obey the "calling conventions" and drops the
parameters from the stack - per the HLL convention - into the right
registers and calls the actual low-level procedure...and this is what
I mean when I say "portability is a _higher level_ concern"...the
procedure should simply do its job with the least "fuss"
possible..._IF_ some "portability" stuff is required with some HLL
"calling convention" then it's possible to simply build a higher level
set of "wrappers" that do this "portability" work, calling into the
real function...of course, then the "stack" stuff _does_ add these
"big NOPs" and lesser performance but then you're coding a HLL, you
want portability, etc....in other words, you're choosing to pay the
price for that stuff...and you would be paying it using specially
written HLL versions of the functions, anyway...extra cost for this
approach? One extra "CALL" instruction (which is an unconditional
branch that the CPU should be "absorbing" the cost with its caching
and so forth...both wrapper and procedure are user code - same
priviledge - so there won't be any "transition" costs as you might get
with a OS API function ;)...the parameter stuff - what gets moved
where - actually balances up to what would basically be involved with
a procedure that has the HLL stuff hard-coded into it...but doing it
this way provides the _choice_: Don't care for "portability"? Then
improve your performance with an "INT 80h" style call, to borrow Linux
as an example of this method...do care? Important to your program?
Then call into the C "wrapper" instead...both does the same thing -
exact same code actually does the work - but there's a _choice_ into
how you want to access it...
The issue here is, in a sense, a confusion between "value" and
"variable"...a variable has a value but the two aren't quite the same
thing...passing things via the stack is thinking in "variable"
terms...realising that you can avoid it in many, many cases is
thinking in "value" terms, so to speak...symbolically, the variable is
more than merely its value so it has to be copied via
stacks...logically and physically, though, the value _IS_ the
variable...it's NOT the memory address that the value is stored at for
recall, it's NOT the register you load it into, it's NONE of these
things...but it takes a different mindset to see it: the _VALUE_ is
the "variable"...you're just "storing" the value at a memory address
for later recall (not enough registers to store everything there so
pop it into RAM :)...you're just "manipulating" the _value_ when it's
in a register and opcodes are applied to it...
It's another one of these places where "abstraction" can temporarily
blind people...as long as the _values_ are all making sense with
regard to program logic, then _how_ this gets done is actually not
particularly important...but when you look at a "variable" in its
symbolic context, then the _storage_ is made the most important
thing...symbolically, "VarA" is the contents (MASM "offset" style) or
address (NASM _actually consistent_ "square brackets" style ;) of the
_storage_...
The difference here is to recognise the "arithmetic" under our
"algebra", so to speak...algebra is similarly a "symbolic" abstraction
of the arithmetic...but a symbolic view - though it's capable of
providing perspectives that couldn't even be seen without the
abstraction (like Pythogoras' theorem, straight line equations, etc.
;) - can sometimes mislead from the arithmetic underneath...for
example, in algebra, you can't reduce "2x + 2y" because _symbolically_
we can mix our abstraction...arithmetically, though, if "x = 2" and "y
= 3" then nothing whatsoever stops us adding them together...and if "x
= 0" then the "2x" bit doesn't actually exist arithmetically (anything
multiplied by zero is zero so that term simply "vanish" arithmetically
:), anyway...
So, am I saying "don't use algebra"? ABSOLUTELY NOT! Its power and
flexibility and the clarity that the abstraction can bring are
incredibly useful...their power has proved itself over the centuries,
I think, that there's _NO DOUBT_ about this power at all...so, does
this mean "never look at the arithmetic"? Ah, also "No!" but this is
one of the problems with all types of abstraction - be that HLLs, be
that device drivers, be that algebra, be that _ANY_ abstraction - that
you abstract away the underlying "irrelevent details" to aid your
focus...that power is important and often amazing (arguably, it is
this sole ability that makes human intelligence what it is...we're all
geniuses at this basic ability to amazing levels, such as, for
example, being able to comprehend squiggly shapes on your monitor and
actually converting that into English - which, for non-native English
speakers, might be being converted _via_ some other "default" language
in their minds - and then into abstract thoughts in your brain about
what I'm saying here :)...
But, in tribute to the first female Oscar nomination for directing
happening at the Oscars, as was the name of her film, some things can
often get "Lost in Translation"...fundamentally, ASM can often improve
a program in completely non-trivial "leaps and bounds" on speed, size,
algorithm, etc. and, ultimately, the lone way it manages to do this is
because it does NOT "translation" so suffers no "loss"...and,
amazingly, that's all it ultimately is...which is why this point
_isn't_ as trivial as it may first appear...you know, "value?
Variable? Symbolic this? Logical that? What on Earth is she going on
about? There's no difference...or only a subtle inconsequential
difference, at most"...and, in a sense, that's right..._EXCEPT_ the
"inconsequential" part..."small" or "simple" does not necessarily mean
"inconsequential"...
"When I see the tremendous consequences that can come from small
things I am tempted to think...that there are NO small things"
[ Bruce Barton ]
> Testing 1_000_000 copyrects inline without call _______ :102300 004
> TimeStamp result_______________________________________ :138317
036017
Hey, that demonstrates just how well the whole cache / pipeline /
prefetch stuff actually works on modern CPUs...the "CALL" is being in
large part "absorbed" here by the stuff on modern processors that
pre-fetches code ahead of time ("CALL" is a branch but it's
unconditional that it can grab it ahead of time, no problems :)...I
suppose I am thinking in an "out of date" way (hey, I was programming
286s and 386s and other non-PC stuff before "pipeline" was properly
implemented!! ;) to suggest that it would make that great a difference
on modern CPUs...
The architectural improvements made by CPU manufacturers have,
unsurprisingly, concentrated on "absorbing" as much penalty from this
stuff as feasible...which is actually incredibly useful because with
most things being programmed in HLLs and modern OS architectures
moving towards _everything_ being done via API procedures, this stuff
is most certainly needed...look how much we're losing on the parameter
passing stuff via the stack already (though, as parameters by their
nature can be any amount, any type, any order, etc. there's not much
CPU manufacturers can do to generalise optimising this
stuff...although, if they did, then the improvement would, once again,
prove to be dramatic :)...so, as much as can be stripped off the
actual "CALL" is welcome...
One interesting "educational" thing you could do is change the "CALL
CopyRect" to a "mov eax, offset CopyRect; CALL [ eax ]" (register not
important, just an example and the syntax "generic" :)...the idea
being to force the CPU to not be able to "prefetch" the code because
it doesn't know where it is until the "MOV" instruction loads in the
correct address...it's not just the extra instruction and the
indirection which should prove bothersome to the performance here...we
should also begin to see how much is "absorbed" by CPU "pipeline"
architecture...note that this only works to try to cut out the
"pipeline" on the CALL itself, the rest is still working that
way...but older machines had no form of "cache" or "pipelining" at all
on anything and the difference that this "factory production line"
idea makes is easy to underestimate...for instance, _ALL_ instructions
would take multiple clocks (simply a case that a clock was wasted
fetching, another one decoding, another few if it needed some RAM or
something, etc. and this was before the instruction was even executed
:) and the concept of "1 clock" or even "for free" instructions was
totally unheard of...
There is some truth to "Look at how much better CPUs are today! Look
at those amazing CPU speeds! There's no need to optimise
anymore!"...the part where it falls down is the "no need to optimise
anymore" bit...if you don't still try to optimise, then these
improvements are about making your unoptimised code run like optimised
code used to do...which is a great thing...until you consider that if
you _do_ optimise, then all the CPU improvements are now going
directly into a _better program_...kind of like saying: "the invention
of the car rather than walking everywhere has greatly improved things
that now I can drive my car with _no effort_ (at 3mph, about the same
speed I would walk :) to my destination!"...umm, yeah, you _can_ do
what you would have done walking but with less effort, that's true
enough...but you're kind of missing the point that your car can go on
a motorway (US: highway, Germany: Autobahn, etc. :) and travel at
70mph or more and, in the same time it would have taken to walk, you
can cover immense distances...or, alternatively, go the same distance
at a faster speed, to get someone in less time and so on...it might
take a half-day to walk to a fairly distant town but a jet plane can
get you to the other side of the world(!!) in the same amount of
time...you're kind of slightly _missing the point_ of why these
"optimisations" were made if all you think they are about is allowing
you to write exactly the same program but with less effort, as you can
"bloat" it to extreme levels - hundreds of MBs of RAM, GBs of hard
drive space, etc. - and still happily market it as...ooh...an
"operating system" or something...naming no names, as it's so obvious
who we're talking about, I'd be highly patronising to you to say ;)...
> Testing 1_000_000 copyrects inline using epb _______ :138321 004
> TimeStamp result_______________________________________ :165393
027072
The reason for this here should be slightly obvious; Without
"borrowing" EBP, then ECX gets used to a double purpose and we have to
shuffle two values back and forth in ECX for every iteration...leave
ECX as the counter and just "borrow" a "spare" register and then we
can use one register each for the two purposes...no "shuffling"
required...so all the time spent on that? Totally vanishes and we get
that time back to actually processing things rather than "management
overhead"...
> So the last version is the fastest, as expected, but the important
> improvement seem to have been made from removing the stackframe,
while the
> call removal bought us very little.
The "CALL removal" doesn't buy much on modern processors because the
things actually "absorb" a lot of the cost with caches / pipelining /
prefetching...this doesn't make as dramatic a difference as it used
to...but, then again, it is _still_ an "improvement" even if not a
dramatic one like the others...CPU optimisation is very good these
days but then if you don't give it anything that requires
"optimisation" in the first place then that's better, even if not by
as a great amount as other things...also, if we're looking at this
relative to the older machines which didn't do this stuff, then even
this apparently "small" difference is proportionally impressive
because the whole CPU is already running perhaps a hundred or so times
faster than those machines...hence, a "clock cycle" is, in real-world
times, a much smaller unit of measurement...
Or, in other words, what we've saved here - taking software _and
hardware_ over time into account - is enough time to do some pretty
amazing _real-world_ things...because, in the end, this is the
ultimate measure: the _practical_ one of what can get done usefully by
the machine in the least amount of resources...
That is to say, what we've accomplished in optimising this code with
simple little tricks and optimisations might, indeed, be equivalent to
the _entire processing power_ of some of those much older
non-pipelined machines...and, hey, they were _perfectly capable_ of
doing some mightily impressive stuff even with such limited resources
(I mean, you can get to the Moon with only 32KB of RAM!! ;)...
> The use of ebp gave us quite a bit.
Avoids pointless shuffling of registers when EBP can be used...note
that to make EBP "free" for this kind of thing, you'll have to say
bye-bye to those "stack frames" as they use EBP...but saying bye-bye
to stack frames was also the thing making the most dramatic
difference, anyway, so it's clear which direction to head here, right?
;)
Beth :)
- Previous message: The Half A Wannabee: "The Great Debate V. What have changed ?"
- Maybe in reply to: hutch--: "Re: Cost of calling a standard library function"
- Next in thread: The Half A Wannabee: "Re: Cost of calling a standard library function"
- Reply: The Half A Wannabee: "Re: Cost of calling a standard library function"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]