Re: Cost of calling a standard library function

From: Beth (BethStone21_at_hotmail.NOSPICEDHAM.com)
Date: 03/02/04

  • Next message: Daniel Ellard,,,: "Re: asm reference for 390/z-series machines?"
    Date: Tue, 2 Mar 2004 14:16:42 -0000
    
    

    The Half A Wannabee wrote:
    > Beth wrote:
    > Thank you for a very thorough and nice explanation Beth! the strange
    thing
    > is, even though I have read all this before, its very existing to
    read
    > about. Almost sexy. I have set up 4 versions of you code and timed
    them, but
    > for the last two version I left the push edi esi calls, as theese
    are
    > absolutely required in reality, so it would be unfair to leave that
    out, as
    > the code would be artificial without them.

    Yeah, fair enough...like I said, I was being "illustrative" of what to
    try out and stuff, rather than it being 100% "cut and paste" "as is"
    code...I'm not particularly sure how to _read_ your timer results
    below, let alone think I know all about how that works ;)

    > here are the results : The important number is in the right column
    at the
    > lines that says result.
    >
    > Testing 1_000_000 copyrects ala Beth___________________ :017 017
    > TimeStamp result_______________________________________ :063225
    063208

    Are these results directly comparable to the previous results you had?
    Because, in fact, this is a higher number than your version in the
    previous set of results...hence, _if_ this is directly comparable,
    then I wouldn't have expected that...in which case, this suggests I
    _may_ have fallen foul of some "optimisation rule" and am getting a
    penalty because of it...so, that _would_ be something to look into -
    if, as I say, these results are directly comparable - because there
    might be something that could be done to improve things...and that's
    even possibly true on the results below which are doing better still
    because, though faster from dropping the stack and using EBP, they
    could _still_ be getting this penalty, if there is one...

    Mind you, I don't know how this timer thing works exactly or what
    units we're measuring here (TSC units?) or whatever...hence, _IF_ the
    results aren't directly comparable, then the above doesn't
    apply...perhaps put them all into a program to get comparable results?

    Anyway, just doing a "Sherlock" on the results...even if the results
    are better overall, there could still be "issues" with it, if some of
    the results aren't going down as expected in the right places...

    > Testing 1_000_000 copyrects ala Beth no stack__________ :063228 003
    > TimeStamp result_______________________________________ :102296
    039068

    And this result - a 61% increase - I think starts to explain why one
    point I always make about HLLs and calling conventions is that bloody
    stack!! Especially because, if you think over the logic, as it's not a
    recursive procedure, then stuffing ESI and EDI onto the stack only to
    pull them straight back off of it is a physically _redundent_
    action...you're copying something onto the stack only to take it back
    off again and stick it into a register (which, worse, could be the
    exact same register we copied from, meaning that we're copying a value
    via the stack back to where it already is, anyway!! ;)...

    HLLs do the stack thing because it's "recursive safe"; _IF_ this
    procedure were to be recursed then the stack makes perfect sense as a
    means to keep track of things...although, generally, this is usually
    _NOT_ the case with a large amount of code...hence, we actually tend
    to have what, in the main, is a inappropriate "default" (but the
    reasons for that make perfect sense that, really, for a HLL compiler,
    it _couldn't_ feasibly be any other way ;)...

    This is a grand problem with "generalisations" like this...the stack
    method is always "safe" for all types...if it's recursive, the stack
    is made to measure (even if you didn't use the stack for the parameter
    passing itself then it would be a natural choice for PUSHing and
    POPping your variables so that they don't conflict with each other
    ;)...if it's NOT recursive, then the method _still_ works, it's just
    there's a whole bunch of code in your program that's effectively one
    big "NOP", if you will...it's spending all this time copying values
    from registers to stack and back again, making you wonder "can't we
    just leave it in the register in the first place...or, at most, use a
    'mov' instruction to get it from one register to another one, expected
    by the procedure?"...

    In terms of "safety", this stuff makes sense and HLL compilers and HLL
    calling conventions basically _must_ do something like this, as it's
    simply unacceptable - it would lead to broken code that could not
    actually be fixed without manually editing the ASM output for the
    code - for the compiler to generate "unsafe" code...this can be
    understood (you don't have to like it to understand why it's done,
    after all ;) why the automated solution prefers not to do this...

    But in terms of how much code is actually recursive to need this
    default? Generally speaking, recursion is an exception and not the
    rule...so, we understand why we have this "default" perhaps but, in
    actual practice, it's just adding a "big NOP" onto all the procedures
    that, on average, doesn't need it...and seeing as even for this simple
    procedure with only two parameters, we've won a 61% increase then you
    kind of start to get the picture...most especially because in HLL
    programming, there's an awful lot _more_ calling procedures involved
    (practically _everything_ gets done by calling procedures, as HLL
    programming with libraries is very often "jigsaw puzzle programming",
    as I like to call it...you're just putting the "jigsaw pieces"
    together to make the picture you want but you're usually not the one
    creating most of the actual "jigsaw pieces" :)...

    This is a simple win for ASM here; We don't have to obey any "ivory
    tower" calling conventions designed with "genericism" to every single
    procedure possible...it is an acceptable thing here to snip this stuff
    out...and, as I've noted before when I was talking about how an OS or
    library should always take the _lowest-level_ interface possible, this
    does NOT stop HLLs from using these procedures...a set of "wrappers"
    can be made which _does_ obey the "calling conventions" and drops the
    parameters from the stack - per the HLL convention - into the right
    registers and calls the actual low-level procedure...and this is what
    I mean when I say "portability is a _higher level_ concern"...the
    procedure should simply do its job with the least "fuss"
    possible..._IF_ some "portability" stuff is required with some HLL
    "calling convention" then it's possible to simply build a higher level
    set of "wrappers" that do this "portability" work, calling into the
    real function...of course, then the "stack" stuff _does_ add these
    "big NOPs" and lesser performance but then you're coding a HLL, you
    want portability, etc....in other words, you're choosing to pay the
    price for that stuff...and you would be paying it using specially
    written HLL versions of the functions, anyway...extra cost for this
    approach? One extra "CALL" instruction (which is an unconditional
    branch that the CPU should be "absorbing" the cost with its caching
    and so forth...both wrapper and procedure are user code - same
    priviledge - so there won't be any "transition" costs as you might get
    with a OS API function ;)...the parameter stuff - what gets moved
    where - actually balances up to what would basically be involved with
    a procedure that has the HLL stuff hard-coded into it...but doing it
    this way provides the _choice_: Don't care for "portability"? Then
    improve your performance with an "INT 80h" style call, to borrow Linux
    as an example of this method...do care? Important to your program?
    Then call into the C "wrapper" instead...both does the same thing -
    exact same code actually does the work - but there's a _choice_ into
    how you want to access it...

    The issue here is, in a sense, a confusion between "value" and
    "variable"...a variable has a value but the two aren't quite the same
    thing...passing things via the stack is thinking in "variable"
    terms...realising that you can avoid it in many, many cases is
    thinking in "value" terms, so to speak...symbolically, the variable is
    more than merely its value so it has to be copied via
    stacks...logically and physically, though, the value _IS_ the
    variable...it's NOT the memory address that the value is stored at for
    recall, it's NOT the register you load it into, it's NONE of these
    things...but it takes a different mindset to see it: the _VALUE_ is
    the "variable"...you're just "storing" the value at a memory address
    for later recall (not enough registers to store everything there so
    pop it into RAM :)...you're just "manipulating" the _value_ when it's
    in a register and opcodes are applied to it...

    It's another one of these places where "abstraction" can temporarily
    blind people...as long as the _values_ are all making sense with
    regard to program logic, then _how_ this gets done is actually not
    particularly important...but when you look at a "variable" in its
    symbolic context, then the _storage_ is made the most important
    thing...symbolically, "VarA" is the contents (MASM "offset" style) or
    address (NASM _actually consistent_ "square brackets" style ;) of the
    _storage_...

    The difference here is to recognise the "arithmetic" under our
    "algebra", so to speak...algebra is similarly a "symbolic" abstraction
    of the arithmetic...but a symbolic view - though it's capable of
    providing perspectives that couldn't even be seen without the
    abstraction (like Pythogoras' theorem, straight line equations, etc.
    ;) - can sometimes mislead from the arithmetic underneath...for
    example, in algebra, you can't reduce "2x + 2y" because _symbolically_
    we can mix our abstraction...arithmetically, though, if "x = 2" and "y
    = 3" then nothing whatsoever stops us adding them together...and if "x
    = 0" then the "2x" bit doesn't actually exist arithmetically (anything
    multiplied by zero is zero so that term simply "vanish" arithmetically
    :), anyway...

    So, am I saying "don't use algebra"? ABSOLUTELY NOT! Its power and
    flexibility and the clarity that the abstraction can bring are
    incredibly useful...their power has proved itself over the centuries,
    I think, that there's _NO DOUBT_ about this power at all...so, does
    this mean "never look at the arithmetic"? Ah, also "No!" but this is
    one of the problems with all types of abstraction - be that HLLs, be
    that device drivers, be that algebra, be that _ANY_ abstraction - that
    you abstract away the underlying "irrelevent details" to aid your
    focus...that power is important and often amazing (arguably, it is
    this sole ability that makes human intelligence what it is...we're all
    geniuses at this basic ability to amazing levels, such as, for
    example, being able to comprehend squiggly shapes on your monitor and
    actually converting that into English - which, for non-native English
    speakers, might be being converted _via_ some other "default" language
    in their minds - and then into abstract thoughts in your brain about
    what I'm saying here :)...

    But, in tribute to the first female Oscar nomination for directing
    happening at the Oscars, as was the name of her film, some things can
    often get "Lost in Translation"...fundamentally, ASM can often improve
    a program in completely non-trivial "leaps and bounds" on speed, size,
    algorithm, etc. and, ultimately, the lone way it manages to do this is
    because it does NOT "translation" so suffers no "loss"...and,
    amazingly, that's all it ultimately is...which is why this point
    _isn't_ as trivial as it may first appear...you know, "value?
    Variable? Symbolic this? Logical that? What on Earth is she going on
    about? There's no difference...or only a subtle inconsequential
    difference, at most"...and, in a sense, that's right..._EXCEPT_ the
    "inconsequential" part..."small" or "simple" does not necessarily mean
    "inconsequential"...

    "When I see the tremendous consequences that can come from small
    things I am tempted to think...that there are NO small things"
    [ Bruce Barton ]

    > Testing 1_000_000 copyrects inline without call _______ :102300 004
    > TimeStamp result_______________________________________ :138317
    036017

    Hey, that demonstrates just how well the whole cache / pipeline /
    prefetch stuff actually works on modern CPUs...the "CALL" is being in
    large part "absorbed" here by the stuff on modern processors that
    pre-fetches code ahead of time ("CALL" is a branch but it's
    unconditional that it can grab it ahead of time, no problems :)...I
    suppose I am thinking in an "out of date" way (hey, I was programming
    286s and 386s and other non-PC stuff before "pipeline" was properly
    implemented!! ;) to suggest that it would make that great a difference
    on modern CPUs...

    The architectural improvements made by CPU manufacturers have,
    unsurprisingly, concentrated on "absorbing" as much penalty from this
    stuff as feasible...which is actually incredibly useful because with
    most things being programmed in HLLs and modern OS architectures
    moving towards _everything_ being done via API procedures, this stuff
    is most certainly needed...look how much we're losing on the parameter
    passing stuff via the stack already (though, as parameters by their
    nature can be any amount, any type, any order, etc. there's not much
    CPU manufacturers can do to generalise optimising this
    stuff...although, if they did, then the improvement would, once again,
    prove to be dramatic :)...so, as much as can be stripped off the
    actual "CALL" is welcome...

    One interesting "educational" thing you could do is change the "CALL
    CopyRect" to a "mov eax, offset CopyRect; CALL [ eax ]" (register not
    important, just an example and the syntax "generic" :)...the idea
    being to force the CPU to not be able to "prefetch" the code because
    it doesn't know where it is until the "MOV" instruction loads in the
    correct address...it's not just the extra instruction and the
    indirection which should prove bothersome to the performance here...we
    should also begin to see how much is "absorbed" by CPU "pipeline"
    architecture...note that this only works to try to cut out the
    "pipeline" on the CALL itself, the rest is still working that
    way...but older machines had no form of "cache" or "pipelining" at all
    on anything and the difference that this "factory production line"
    idea makes is easy to underestimate...for instance, _ALL_ instructions
    would take multiple clocks (simply a case that a clock was wasted
    fetching, another one decoding, another few if it needed some RAM or
    something, etc. and this was before the instruction was even executed
    :) and the concept of "1 clock" or even "for free" instructions was
    totally unheard of...

    There is some truth to "Look at how much better CPUs are today! Look
    at those amazing CPU speeds! There's no need to optimise
    anymore!"...the part where it falls down is the "no need to optimise
    anymore" bit...if you don't still try to optimise, then these
    improvements are about making your unoptimised code run like optimised
    code used to do...which is a great thing...until you consider that if
    you _do_ optimise, then all the CPU improvements are now going
    directly into a _better program_...kind of like saying: "the invention
    of the car rather than walking everywhere has greatly improved things
    that now I can drive my car with _no effort_ (at 3mph, about the same
    speed I would walk :) to my destination!"...umm, yeah, you _can_ do
    what you would have done walking but with less effort, that's true
    enough...but you're kind of missing the point that your car can go on
    a motorway (US: highway, Germany: Autobahn, etc. :) and travel at
    70mph or more and, in the same time it would have taken to walk, you
    can cover immense distances...or, alternatively, go the same distance
    at a faster speed, to get someone in less time and so on...it might
    take a half-day to walk to a fairly distant town but a jet plane can
    get you to the other side of the world(!!) in the same amount of
    time...you're kind of slightly _missing the point_ of why these
    "optimisations" were made if all you think they are about is allowing
    you to write exactly the same program but with less effort, as you can
    "bloat" it to extreme levels - hundreds of MBs of RAM, GBs of hard
    drive space, etc. - and still happily market it as...ooh...an
    "operating system" or something...naming no names, as it's so obvious
    who we're talking about, I'd be highly patronising to you to say ;)...

    > Testing 1_000_000 copyrects inline using epb _______ :138321 004
    > TimeStamp result_______________________________________ :165393
    027072

    The reason for this here should be slightly obvious; Without
    "borrowing" EBP, then ECX gets used to a double purpose and we have to
    shuffle two values back and forth in ECX for every iteration...leave
    ECX as the counter and just "borrow" a "spare" register and then we
    can use one register each for the two purposes...no "shuffling"
    required...so all the time spent on that? Totally vanishes and we get
    that time back to actually processing things rather than "management
    overhead"...

    > So the last version is the fastest, as expected, but the important
    > improvement seem to have been made from removing the stackframe,
    while the
    > call removal bought us very little.

    The "CALL removal" doesn't buy much on modern processors because the
    things actually "absorb" a lot of the cost with caches / pipelining /
    prefetching...this doesn't make as dramatic a difference as it used
    to...but, then again, it is _still_ an "improvement" even if not a
    dramatic one like the others...CPU optimisation is very good these
    days but then if you don't give it anything that requires
    "optimisation" in the first place then that's better, even if not by
    as a great amount as other things...also, if we're looking at this
    relative to the older machines which didn't do this stuff, then even
    this apparently "small" difference is proportionally impressive
    because the whole CPU is already running perhaps a hundred or so times
    faster than those machines...hence, a "clock cycle" is, in real-world
    times, a much smaller unit of measurement...

    Or, in other words, what we've saved here - taking software _and
    hardware_ over time into account - is enough time to do some pretty
    amazing _real-world_ things...because, in the end, this is the
    ultimate measure: the _practical_ one of what can get done usefully by
    the machine in the least amount of resources...

    That is to say, what we've accomplished in optimising this code with
    simple little tricks and optimisations might, indeed, be equivalent to
    the _entire processing power_ of some of those much older
    non-pipelined machines...and, hey, they were _perfectly capable_ of
    doing some mightily impressive stuff even with such limited resources
    (I mean, you can get to the Moon with only 32KB of RAM!! ;)...

    > The use of ebp gave us quite a bit.

    Avoids pointless shuffling of registers when EBP can be used...note
    that to make EBP "free" for this kind of thing, you'll have to say
    bye-bye to those "stack frames" as they use EBP...but saying bye-bye
    to stack frames was also the thing making the most dramatic
    difference, anyway, so it's clear which direction to head here, right?
    ;)

    Beth :)


  • Next message: Daniel Ellard,,,: "Re: asm reference for 390/z-series machines?"
  • Quantcast