Re: Cost of calling a standard library function
From: Beth (BethStone21_at_hotmail.NOSPICEDHAM.com)
Date: 03/03/04
- Next message: The Half A Wannabee: "Re: Cost of calling a standard library function"
- Previous message: Herbert Kleebauer: "Re: A (mild-mannered) defense of RosAsm"
- In reply to: The Half A Wannabee: "Re: Cost of calling a standard library function"
- Next in thread: The Half A Wannabee: "Re: Cost of calling a standard library function"
- Reply: The Half A Wannabee: "Re: Cost of calling a standard library function"
- Reply: C: "Re: Cost of calling a standard library function"
- Reply: Frank Kotler: "Re: Cost of calling a standard library function"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 3 Mar 2004 14:57:30 -0000
The Half A Wannabee wrote:
> Beth wrote:
> > Are these results directly comparable to the previous results you
had?
>
> Very good question. The truth is, I didnt notice :-) Because when I
studied
> the function you worte, where you had rearranged the float and usage
of the
> registers, I found it so beautifully written, that I automatically
> considered it to perform faster then my own code. Here are the two
again, to
> avoid the confusion of having to look past to the other posts: And
now I see
> the reason (I think) , and what you forgot (maybe) when you made
that
> streamlined procedure. It accesses/reads memory using esi 4
instructions in
> a row, and then edi four instructions in a row. If you look at my
proc,
> you'll see (as I do now (= I did _not_ do it intentionally, it was
just pure
> luck ))- you will see that it writes to ebx, the reads from it, etc,
while
> your code keeps reading the memory through esi register 4 times in a
row.
> But (and now I am just speculating widely) while the memory for both
the
> rectangles are in the cache by now, the processor is probably
> "understanding" that it doesnt need ebx to carry the value, while it
can
> safly move it within the cache, without having to go via ebx. Maybe
this is
> a totally wrong assumtion, but if it is, this should maybe be
implemented?
> So why cant it do it with the way you wrote your instructions ?
Because you
> use the same register to adress another _memory_ location in the
instuction
> imidiatly following, using the same register to spesify diffrent
memory. But
> this is just speculation on my part. It may be dead wrong.
>
> (Later : ) Okey speculation will not help !!! I have now tested my
register
> arrangements, in an inline function, and now they are about the
same. Mine
> is insignificantly faster, but that is no more than variations in
current or
> workload of the OS.
Then, basically, your code was not generating any kind of extra
"stalls" and my change made no practical difference there at all...BUT
try it the same thing on a different earlier CPU, that might not be
the case...the modern CPU improvements have all been about "out of
order" stuff that this is exactly the effect that's designed to
fix...because the CPU is actually automatically attempting to optimise
your code "on-the-fly"...and though you use the same register all the
time, what could be happening here is that it's "register renaming"
automatically to do what I did without having to actually change the
code...the CPU changes things around on-the-fly instead...
So, if we tried this across more than one machine, then my changes
might help with an earlier CPU that ain't so clever with "on-the-fly
optimisation"...in which case, though it makes no difference on your
machine, my code could be preferable if it turns out to be better
(actually performing the optimisation your CPU automatically does
on-the-fly manually :) on the older CPUs...when it comes to this kind
of thing, the results of only one machine don't always give the full
picture...so things that race away on one CPU actually could be awful
on another different CPU...
[ snip ]
> > This is a grand problem with "generalisations" like this...the
stack
> > method is always "safe" for all types...if it's recursive, the
stack
> > is made to measure (even if you didn't use the stack for the
parameter
> > passing itself then it would be a natural choice for PUSHing and
> > POPping your variables so that they don't conflict with each other
> > ;)...if it's NOT recursive, then the method _still_ works, it's
just
> > there's a whole bunch of code in your program that's effectively
one
> > big "NOP", if you will...it's spending all this time copying
values
> > from registers to stack and back again, making you wonder "can't
we
> > just leave it in the register in the first place...or, at most,
use a
> > 'mov' instruction to get it from one register to another one,
expected
> > by the procedure?"...
>
> I have took your advice, and for those procs where I need most
speed, I have
> removed the stack. Like in the case of allocation memory and stuff.
But I
> have more to do with this. I will have to look over all of my code,
and try
> to use mov x, y mov y,x instead of push/pops. But its nice to use
the stack
> sometimes, as it provide extra undeclared space, but maybe a memory
variable
> is just as fast ? Havent tried that. I will now try that for your
ebp-inline
> code and see whats faster.
>
> OMG: Using a memorystack....is faster !!! OMG!!! It wins 6000 TICS
on
> push/pop ebp !!!
On the other hand, it's sometimes impossible to pre-declare space and
the stack or memory allocation works out best for that...and consider
writing recursive functions that could go to any arbitrary depth
calling themselves over and over _without_ using a stack...a total
nightmare and if you can't really predict how deep it'll recurse ahead
of time then this "memorystack" thing won't be at all feasible to
use...
This is kind of the underlying point of optimisation...it's NOT
clear-cut...you can improve things just by changing the order...or
even improve performance by _adding_ instructions (turning a CISC
style sequence into a RISC style sequence of more simpler
instructions...the modern CPUs are, in fact, totally RISC "under the
hood" these days and only present the "CISC" stuff for "backwards
compatibility"...hence, they often don't do as well as you might
think...plus, the CPU manufacturers kind of think of them as
"obselete" that these instructions rarely get any special optimisation
treatment like the other instructions :), not by taking them away...
Optimisation is about getting the "right fit" with the machine and,
often, the only hard and fast rule is that there's no hard and fast
rules...it's quite possible that something to avoid in one situation -
using the stack - is something you _should_ be using elsewhere...
This, of course, is a really unsatisfactory answer...it means you
can't just "follow rules" but have to try things out and see what
happens and think through what will be the best thing for your
particular code case by case...unfortunately, it's also how it
is...don't shoot the messenger for the message, as they say ;)...
[ snip ]
> here is the code for you inline using ebp away in a memoryvariable:
>
> [MemoryStack: &NULL]
> call TPerformanceCounter_TimeStampRemark TestRemark7
> push edi
> mov ecx 1_000_000
> @TestLoop5:
> mov D$MemoryStack ebp
> push edi esi
First, why not also see what happens if you also put EDI and ESI into
your "memorystack"?
Second, this stuff is technically "invariant code" and can be thrown
outside the loop so that it's not continually done all the
time...although, whether that's appropriate depends on what you're
testing...it might be "invariant code" here but wouldn't be
"invariant" in a different context...
> mov esi ARect
> mov edi BRect
In fact, this is also "invariant" for the purposes of your
test...although, clearly, in a real-life situation, we'd be passing
different RECT structures to the code...
Which brings in the question of whether this "test" is actually a
reasonable, representative test...that is, because you're using the
same RECTs all the time, then they are sure to be cached...but in a
real-life situation where there's all different RECTs being used
everywhere then how the code stands up when the stuff isn't cached is
another question this "test" isn't really answering...
> mov eax D$esi + TRect_Left
> mov ebx D$esi + TRect_Top
> mov ebp D$esi + TRect_Right
> mov edx D$esi + TRect_Bottom
>
> mov D$edi + TRect_Left eax
> mov D$edi + TRect_Top ebx
> mov D$edi + TRect_Right ebp
> mov D$edi + TRect_Bottom edx
>
> pop esi edi
> mov ebp D$MemoryStack
> sub ecx 1
Why "sub ecx 1" rather than "dec ecx"? It would be shorter (just a
single byte opcode :) and doesn't need the access for the immediate of
"1" stored inside the instruction (stored as byte or dword? We'd know
if we used Herbert's syntax or AT&T...not that it should matter too
much but for size...although, remember, size _can_ effect speed due to
how the cache works with "lines" and things like that...it's NOT
clear-cut that smaller always means faster because of the weird way
CPUs work but you shouldn't totally ignore this consideration and
should check it out...for "tight inner loop" stuff, going further than
you can see with the ASM source itself into actually considering how
it'll fall into "cache lines" and so forth is one place where you can
push out better optimisation...and, yeah, it could literally amount to
delibrately _aligning_ the top of your loop - padding with some "NOPs"
which only get run once before the loop that they amount to nothing
significant - and then the code sits inside the cache better and the
CPU prefers to grab the aligned stuff and jumping to an aligned
label...that's why my code example I posted for HLA before stuck
"align(4);" before every procedure, just to be sure that alignments
were always correct...having "NOPs" between the procedures where the
CPU will never actually go is a few bytes "bloat" that could help with
the performance...RosAsm does automatic alignment on data but, well,
just check its what you want and expect...the issue here is that if
you make "unaligned" accesses on an Intel chip, what actually happens
is that it makes two accesses and then stitches them together "under
the hood"...at the source level, you don't really see a problem and
code will work unaligned but it can often work much better when
aligned...do some byte counting, NOP inserting and get that "start of
loop" thing to be on a multiple of 4...or, get more pedantic and think
about the "cache lines" and try to see if you can get the whole thing
into one or two "lines", starting at the start of a "line"...which are
32 bytes big, if I recall correctly :)...
Actually, let's not push the point but is there a _code_ alignment
directive in RosAsm to help with this or are we forced to always count
the bytes and insert the NOPs by hand every time?
Ah-ha! Rene didn't think of it originally but, in fairness, he did
listen when others mentioned the importance so this is now available
in RosAsm for this kind of thing...
>From "B_U_ASM":
-------------- 8< ---------------
I first didn't think of implementing any Align feature for code as
this seemed to me an 'out of purpose' idea. But some other programmers
told me that it was an absolute must have in an assembler. So, here is
it:
align 16 ; (decimal only parameter)
>>> inserts the needed number of NOPs to reach the desired boundary.
If NOPs number If the NOPs number is greater than 4, the two first
NOPs are replaced by a short JMP.
Intel documentation gives only two cases for using Alignment:
A loop entry label should be 16-byte aligned when it is less than 8
bytes away from that boundary.
A label that follows an unconditional branch or function call should
be 16-bytes aligned when it is less than 8 bytes away from that
boundary.
-------------- >8 ---------------
As we can see at the bottom, Intel themselves point out that their
chips work better when labels are aligned to certain rules...this
makes it fit in with how the CPUs fetch instructions and work with the
caches, one presumes...
> jnc @TestLoop5
Actually, why "sub" followed by "jnc" for the loop?
I'd naturally think "dec" followed by "jnz" instead..."dec ecx" is
just a one byte opcode and will set the zero flag when it hits zero
that "jz" and "jnz" can go directly after them...not that this will
probably make any difference except that "dec" is shorter that it
might just happen to make a small improvement, perhaps...
Of course, this "loop" is kind of false...for "test purposes" rather
than how the code would actually appear in real code...because,
otherwise, there's also the question of whether you could "unroll" the
loop a bit...that is, copy the code out twice and then only loop half
as many times...or "unroll" it to copy the code out four times and
then only loop a quarter of the times...means you can only iterate the
loop on multiples of how many times you copy the code out...but the
idea here being that it can run two or four or more times before
hitting the loop code (which has a conditional branch in it and that
stuff messes with caches and "branch prediction" so do it as least
often as possible...even, if possible, though not possible on loops,
you can generally "do a Terje" and convert something that has a
conditional branch in it into code that doesn't branch at all but uses
clever arithmetic tricks to be functionally equivalent to something
with branches in it...no, I have no idea _how_ Terje always manages
these things...I wish I did because then I could be the "optimisation
guru" but, instead, I defer to his expertise on the matter...I'm
satisfied that I actually ended up having the same basic idea as him
on a "spinning bitmap" problem...thinking like Terje does about
optimisation? What an honour!! But I'm quitting while I'm equal on
that score ;)..
> pop edi
>
> call TPerformanceCounter_TimeStampRemark TestRemark2
>
>
> Now this was REALLY interesting. Thanks Beth !!! I would have never
thought
> of that if it wore not for you, and this discussion. Its so great to
have
> this conversation. I take back all the bad things I said about you
(in case
> I did). This is really really fruitful. Now lets replace all the
push/pops
> with memorystacks and time again
Don't worry; Everyone jumps to conclusions they shouldn't and
sometimes gets the wrong idea about things (Rene's being held to
account for what he says, as he does talk complete crap from time to
time...mind you, don't we all? Just Rene's crap could be dangerous in
misleading people so it has to be challenged and so forth :)...you
were already "forgiven" before you apologised, anyway...but always
nice to hear nevertheless :)
Have a look here at Paul Hsieh's optimisation stuff:
http://www.azillionmonkeys.com/qed/asm.html
And also read the "article" bit about "HLLs vs. ASM" because it's
interesting...as well as noting that Paul refers you to Randy's "Great
Debate" text at the end (broken link, as Randy moved things :)...if
you listen to Rene, then you could get the impression that Randy isn't
respected and often referred to by many ASM programmers as a
"reference" to quote and look up...
Beth :)
- Next message: The Half A Wannabee: "Re: Cost of calling a standard library function"
- Previous message: Herbert Kleebauer: "Re: A (mild-mannered) defense of RosAsm"
- In reply to: The Half A Wannabee: "Re: Cost of calling a standard library function"
- Next in thread: The Half A Wannabee: "Re: Cost of calling a standard library function"
- Reply: The Half A Wannabee: "Re: Cost of calling a standard library function"
- Reply: C: "Re: Cost of calling a standard library function"
- Reply: Frank Kotler: "Re: Cost of calling a standard library function"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]