Re: Cost of calling a standard library function

From: Beth (BethStone21_at_hotmail.NOSPICEDHAM.com)
Date: 03/03/04


Date: Wed, 3 Mar 2004 21:34:34 -0000

The Half A Wannabee wrote:
> So there should occur no spesial AGI stall because of this? As I see
it, it
> "proves" that push/pop is more costly than move to/from memory. This
makes
> sense, since push Allocates memory, and pop deallocates it. But
still one
> should hope the spesial processor instruction would do it faster.

Not actually true; PUSH and POP don't "allocate" and "deallocate" in
that sense, they simply increment ESP down and up as appropriate,
copying value to / from a register with [ ESP ]...the stack itself is
just one long bit of "scratch pad" memory that can be used any old way
by the CPU for the stack...the whole stack is "allocated" in one go
and it's simply the natural "FILO" (first in, last out) structure of
the stack that means things are always natutally kept in order...

Hence, all the CPU does is, basically:

PUSH eax = MOV eax, [ esp ]; SUB EAX, 4
POP eax = MOV [ esp ], eax; ADD EAX, 4

The "allocation" and "deallocation" isn't any literal call to the OS
to ask for memory...it really is just a case of adding and subtracting
numbers from ESP...that's, in fact, how you'd create your "local
variables" in a program...stick in a "SUB ESP, SizeOfVariables" to
"allocate" the room for the variables and then "ADD ESP,
SizeOfVariables" to "deallocate" it at the end...

Oh, if you're wondering then the stack is "upside-down" in
memory...hence, you allocate by moving ESP _down_, not up which is
probably what most people would expect...but the stack is put into
memory "upside-down" (a delibrate thing so that it can potentially
share memory with other data working upwards...putting it in a
simplistic way, the stack can be put at the top of memory and the rest
of the data at the bottom then things can be allocated moving towards
the centre...if they meet? "Out of memory error!!"...slightly more
complex than that in practice with virtual memory and stuff...but
that's the basic idea of why it's "upside-down" because, of course, to
the computer itself, there is no "up" or "down" really that it
honestly doesn't care which way it works, so there's no problems
putting the stack into memory the wrong way around and that it "grows"
downwards rather than upwards, which is the usual direction people
tend to expect things to work but not in this case ;)...

> But this is not the weirdest thing. The weird thing is, that in the
code
> below, where all push/pops are made into memory moves, for a total
of 3/6
> memory moves compared to originally 3 pushes, is not faster by 12000
tics,
> but only by less then < 2000. So while changing push ebp to mov
mem32 ebp
> got us 6000 ticks in total, the replacements of edi AND esi pushes,
got us
> only another 2000 ticks.

I tried to warn you that this stuff ain't "clear-cut" by any
means...it's _rarely_ ever a simple case of "this instruction adds
this amount" or "take this instruction away to get this amount
back"...this is especially true on modern CPU architectures which
their increasing use of parallelism and "out of order" execution and
so forth...it's even possible to get "free" instructions (effectively
_instant_ taking no clocks at all, so to speak :)...the truth is that
these instructions aren't really "free" but are occuring alongside
other instructions _at the same time_ that it takes no additional time
and comes out measured as taking zero clocks, in a manner of speaking,
which seems impossible...but that's the "parallelism" for you...the
stuff is, of course, actually taking time to do but because more than
one thing happens at the same time without waiting for other things
then all those things that run simultaneously end up seemingly taking
no additional time at all(!!)...

And one factor that comes into this is that the surrounding
instructions can also effect how fast an instruction takes...all that
"AGI stall" and "partial stall" stuff that an instruction could be
quick or slow depending on what instructions proceed it and what
"dependencies" we have between the instructions (because for running
these things at the same time, they shouldn't "depend" on each other
because that would mean one instruction _waiting_ for the other
complete before it can run so they can't be "parallelised" :)...

And the complexity of this stuff has gotten more and more complex with
each chip...with "out of order" execution, it really does throw a
whole bunch of instructions into a "pool" and then says "right, what's
the best order to run these instructions in that ensures that it all
runs as expected - we can't have the mathematics done out of order,
getting the wrong results - but the least amount of delays"...these
latest OOE CPUs have, in fact, an "on-the-fly optimiser" built-into
the execution unit itself (which, by the way, is even more complicated
by the fact that the "pool of instructions" is NOT necessarily the
machine instructions you typed into the source because the CPUs are
actually RISC at their core - only the really simple instructions -
and the more "CISC-y" instructions are actually broken down into
"micro-ops" and it's _those_ that get thrown into the "pool" and may
be executed "out of order"...for example, you see how I wrote "PUSH"
and "POP" as a bunch of simpler instructions? Well, the CPU kind of
automatically does that to your code as it executes...the "core" of
the CPU is now RISC and _only_ contains these "simple" instructions
and merely "translates" the instructions in the source code to a bunch
of simpler instructions...which is kind of cool for the "out of order"
stuff because it's possible that the broken down simpler instructions
have more scope to be "overlapped" than what we would see only looking
at the source code instructions :)...

Oh, yeah...the good old notion of "1:1" really has died an awful long
time ago...assembly language, though, still stands where it always did
because though there are these "micro-ops", this is all internal to
the hardware and you can't change it or program the chip in micro-ops
from software...the ASM we see _is_ still the lowest level available
to _software_...

[ snip ]
> All this info was given in the previous post, and also the timings,
so back
> and recap if you need to.
>
> But the funny funniest thing is that the fastest code was the first
hack I
> wrote, that Randall (lol lol lol) told me was inefficient. He is a
great
> teacher dont you think ? ;-)

Actually, Randy meant that the method _you_ were taking was not
efficient rather than the actual code itself...it really is
"beginner's luck" alone that the first thing you hacked out ended up
working well...it's an inefficient _approach_ in the general
case...you just happened to "fluke" things here...you may not be so
lucky next time and Randy's advice was actually "don't depend on this
always working"...that your _approach_ to the problem ia the
"inefficient" thing, not necessarily the code itself ;)

> Of course. This is not proof. More test would have to be made, and
it had to
> be attacked from several directions to be considered proof. But its
still
> funny that such things happen, its means (just as MAbrash said in
his book)
> that measuring actuall code, is _absolutely_ needed, and that
counting intel
> cycle timings, is a bloody pointless waste of you time.

Oh, yeah...on modern chips, with their parallelism and complexity,
"counting cycles" is not at all a useful thing to do any more...well,
except maybe as a "really rough guide" as to what to expect from
different sequences...up to the Pentium, you could actually roughly do
pretty well just counting up the clocks listed in the manuals...but
then things got "parallel" and "complicated" that it just doesn't work
that way anymore...

> But then again, when all this is said and done, the best optimizer
is not
> this stuff, this is mostly nitpicking, for curriousity. Have
probably zero
> interesst as the next CPU from AMD may break it completly.

Probably; I've seen all the enhancements and some are pretty radically
different from the 286 onwards...now you could say "oh, it always
changes so there's no point keeping up at all"...but then, how you
going to "machine think" when you're ignoring the machine? You don't
have to nit-pick down to individual clock cycles but you kind of have
to keep up with broad ideas like "oh, look, they've added a
cache...hey, the FPU is built-in now...ah, I see they've begun to use
out-of-order execution" and move the style you code with slowly
towards these things...

I mean, what else is an ASM coder's job? Kind of like a fashion
designer or hairdresser saying: "Hey, these fashions just keep on
changing! Who cares what the new black is? I'm just going to stick
with these '70s hairstyles and clothes forever!"...

Sure, you don't need to know everything in exacting detail that you
can personally create your own CPU blindfolded using ordinary
household items or anything silly like that...but it's kind of
implicit in this ASM game that you keep up with at least the rough
changes of the CPUs...but, note, this really isn't as hard as it
sounds...unlike learning all the new changes in the Java library when
it changes versions, when a new CPU comes out it tends to just add one
set of new instructions - MMX, SSE - and some different kind of
improved architecture which you can basically get the idea of, like
pipelining, pairing, out-of-order execution, etc....

But ignoring the machine completely and then also saying that you're
"machine thinking" and are a great ASM coder?

Will you stop listening to Rene! He's only saying that this stuff
"isn't important" now to cover his embarassment that RosAsm didn't do
what it was supposed to do...

You, of course, don't have to obsess over this...but ignoring it
completely? Well, why exactly are you using ASM rather than, say, C? C
takes care of all this stuff for you that you needn't worry about the
machine at all with pretty reasonable results, considering...

AMD _will_ break this stuff next time, guaranteed...but you've got to
at least keep slightly abreast of developments in a basic way...or,
really, if you totally don't care about the machine or how it works
then what you doing in this newsgroup and using ASM? It's like
there'll be new fashions next season but what hairdresser who calls
themselves "good" stays just giving everyone '80s hairstyles
regardless of what the fashion is or what clothes shop just sells '70s
fashions _forever_?

I mean, come on...unlike perhaps any other area, the only requirement
is to read the manuals when a new chip comes out to see what's been
added and changed...and these are _real_ improvements rather than the
kind where Microsoft release a new "library" every two weeks, calling
it a "technology improvement" and making everyone re-learn and force
them to follow along and sign up to "Microsoft certificate"
nonsense...the typical "turn buying a house into buying your weekly
grocercies" thing...keep selling someone the same thing over and over
but change the logo around so that they actually think that there's
some "fundamental new technology" or something...nah, just re-arranged
the functions, added on a parameter or two here and there, renamed
everything and then released it as "version 2"...then wait for the
money to roll in from HLL developers who're locked into this trap and
don't release that they ain't paying for anything new at all...

On the other hand, the ASM coder just needs the new processor
manual...learns some new "extension" like MMX or SSE...re-adjusts how
they think of optimising because of new architectural
improvements...there, job done...

It's no big deal...if you think it is, then consider all the coders
here who used to work on Z80 or 6502 chips...played around with 68K
Motorolas...that's a complete change altogether to something entirely
different often...yet, such ASM coders can still be found here talking
about how to "nit-pick" optimise on x86s...and when the x86
architecture finally dies - if it ever does - then onward and upward
we go...you can't not re-learn things in the computer industry or you
will get left behind because the hardware keeps moving forward...be
thankful that ASM makes the _least_ demands as you only need to learn
how the _hardware_ changes when there is actually _real_
improvement...the HLL people have to keep buying into libraries that
change all the time and then for some new language to be "the new
black" and things can change a hundred times in the HLL world without
there having, in fact, been a _single technological improvement in the
hardware at all_...now those guys really are dogs chasing after their
own tails...at least ASM coders only chase after food - because that
bit is "non-optional", what with it being the means to stay alive -
not their own tail...

> About RosAsm macros. For fun, I was thinking about creating my own
private
> stack, and redefine the push / pop macros in RosAsm ! They can most
easily
> be redefined like this (I use only DWORDS, whenever possible) :
>
> [Push| add CustomStackPointer 4 | mov D$CustomStackPointer #1 ]
> [Pop| mov #1 D$CustomStackPointer | sub CustomStackPointer 4]
>
> Or something, maybe that was wrong actually! I will do this one of
theese
> days, and then TIME the creation and destruction of several millions
of
> objects, strings and memoryallocations, and then see if it makes any
> diffrence to the timings. If it turns out that a custom memorystack
is
> faster then the normal stack....hehe, then I will start to laugh....

Me too; And then when you create all these macros, designed to be
easily modified so that you can make these changes just by tweaking
one macro at the top and it automatically cascades throughout your
program...then when you realise that something like this "memorystack"
might help out your code, you can have fun with "specific assembly"
hand-coding all the changes in all of your programs rather than change
in one place and re-compile...

Hey, don't me wrong...I've just spent my time lecturing you on the
(mild) importance of nit-picking, after all...that's a kind of
"specific assembly"...but, unlike Rene, I'm not saying "always use
macros!", "never use macros!" or whatever...I'm saying: "keep your
options open"...what makes the most sense for you to do in any
particular situation is NOT something that can be pre-decided at
all...the _application_ defines these things...writing some crap "map
maker" personal utility for that "text adventure game" thing? Heck,
why am I even using ASM for that? What a total waste of time! Nah,
just knock together something that only just works and holds together
in BASIC, if it's something just for yourself to do a small task and
you really don't care if the graphics get a bit messy on the screen
and you have to click on another window and back onto your window to
"workaround" the problem...on the other hand, if you're writing the
software graphics engine for a cutting edge computer game that's to
make DOOM III look stupid with its impressive graphics that is totally
a commercial "for public consumption" thing that you _can't_ get away
with a less than professional standard (ignore Microsoft...they
_ain't_ a role model in how they write _everything_ like it was some
crap five minute personal utility with all the weird "oh, always pick
that option first before the other one or it crashes...can't be
bothered to fix it yet!" work-around stuff everywhere...that's NOT
acceptable when you're doing a _professional_ job and _providing a
service_ to the public ;) then, like, get ready to really, really
"nit-pick" that code...

I repeat, what makes ASM is the _Liberty_...all those options to code
up a bunch of HLLs macros and treat it like programming C...or to
nit-pick right down to a clock cycle or two on a specific
architecture...to add "portability" when you need it and to hit the
ejector seat button on it when you don't and laugh watching it
comically fly off into the air with a tiny little parachute silhouette
appearing way off on the horizon, much like that small mushroom cloud
every time Wile Coyote falls down that exact same ravine! Splat! :)

The power comes totally from _Liberty_...to be able to let the
_application_ that you're writing define the terms...so that when
nit-picking allows a program to run ten times faster and make some new
type of 3D graphics like real-time raytracing work, then ASM is there
for you...always able to take the machine right to its limits...but,
on the other hand, you can take a more relaxed attitude and not give
the slightest crap about size or speed using HLLisms and macros
everywhere for that "who cares?" little utility program that won't see
the light of day in public, anyway...and, yeah, sometimes it'll make
more sense to just use C instead...

What's "wrong" here is not which thing you obsess over but the
_obsession_ itself...ASM's main benefit is that everything is exposed
and you're at Liberty to choose _whatever makes the most sense for
your application_ at that point...keeping your options _open_ is what
makes the most sense in general..._thirst_ for that knowledge to never
be tied down to _anything_...ever...make yourself able to _deliver_
regardless of what it might be that you're called on to deliver...

You want "philosophy", Wannabee? I'm Obi-wan Kenobi, Jedi master, when
it comes to philosophy! Why you want to go along with a cheap,
inconsistent imitator like Rene? ;)

Beth :)