Re: Stack and performance



c0d1f1ed wrote:
Hi all,

I am totally puzzled by a weird phenomenon in my code. My debug
version is twice as fast as my release version (Visual C++)! What's
even stranger; adding an offset to the stack pointer solves the
problem. Adding a 64-byte offset results in serious performance
degradation again.

Is it possible that a function executes two times slower when the
stack starts at another address? The function I'm using is quite big
and takes 99% of execution time. The stack frame is ~600 bytes and
16-byte aligned. My Pentium M has 64-byte L1 cache lines but it's
8-way associative and 32 kB so I seen no reason for cache thrashing
effects.

Any possible explanations?

Thanks,

Nicolas Capens


2 things that come to mind but might not really matter:

if the data on the stack is moved around so that important stuff all fits comfortably in it in the fewest cache lines, then that could
account for some performance boost. It may be that the stack of the
important function is starting out at a (almost) cache alligned state for the debugging version and in the non-debugging version it is starting out with an improtant piece of data being the last thing in an otherwise unused cache line. There are other cache issues that might benifit by rearranging the declarations or some of the other code. I'm not sure how concious VC++ is of cache issues but even so it could be missing something that rearranging would cure.
Similar issues may arrise from virtual memory and paging.
Paging and cache issues seem to jump up and bight hard without much warning in some instances. If either of these are the case, both versions of the program should have similar performance when other nontrivial programs are ACTIVE at the same time, because they will cause
paging and cache swaps that should make the others non-issues.


The other thing is the frame pointer (ebp). It is almost always used when compiling for debugging, but is sometimes as an extra general purpose register when compiling without debug settings. I'm not the most knowledgable on such things, but I didn't think that it was a big issue for x86 to use large offsets from the stack pointer, but on some architectures it is. You could force the use of the frame pointer by
calling alloca(0) if the compiler isn't smart enough to optimise it away or you are smart enough to pass it a 0 that the compiler can't determine is a 0 (external global variable set to 0 or something like that).
Test and see.


Nathan

.



Relevant Pages

  • Re: Macros
    ... >> stack?) ... > they do not depend on the sizes or number of local variables (just their ... The CPU is more efficient when it uses the closest L1 cache ... > You can reproduce this scheme for main memory and pagefile: ...
    (microsoft.public.vc.language)
  • Re: Unions in Assembly Language
    ... > uses normal stack calling convention. ... stack is memory, it can break optimal cache usage, because if some memory ... potential of asm, I am convienced one must use asm daily, for years to see ...
    (alt.lang.asm)
  • Re: stack hogs in kernel
    ... This one, at least, is due to an issue Roman pointed out on hackers@ in the last 24 hours -- a MAXPATHLEN sized buffer on the stack. ... tens of KBs from the kernel. ... Is the concern about L1 data cache footprint, ... get good performance benefits from the essentially free memory management ...
    (freebsd-current)
  • Re: Macros
    ... > of local variables and stack usage. ... > (even though I can't begin to understand machine code). ... code size and thus increases the risk of cache misses and the like. ... you can use the inline keyword as a alternative to macros. ...
    (microsoft.public.vc.language)
  • Re: Macros
    ... > of local variables and stack usage. ... > (even though I can't begin to understand machine code). ... code size and thus increases the risk of cache misses and the like. ... you can use the inline keyword as a alternative to macros. ...
    (microsoft.public.win32.programmer.ui)