Re: How much tuning does regular lisp compilers do?
- From: Bakul Shah <bakul+usenet@xxxxxxxxxxxxx>
- Date: Sat, 30 Aug 2008 16:33:21 -0700
Rob Warnock wrote:
CMUCL-19c always allocates objects on 8-byte boundaries, as you can see....
The cache line size of the Athlon is 64 bytes, so there should be only 8 cases we need to distinguish. Let's match them up against "fast" and
"slow" execution:
This is very interesting: When the offset of the start of the function...
is a multiple of 16, the code runs slow; when the offset is 8 *plus*
a multiple of 16, the code runs fast.
So functions which were fast just after being compiled stay fast...
upon repeated execution, and functions which were slow stay slow.
From the disassembly above, we know that the code of the inner loop of
the DOTIMES is 11 bytes long. There are only two possible alignments
of interest, "16*n" and "(16*n)+8", but why is the "(16*n)+8" alignment
the *fast* one?!?
So for the "fast" cases the inner loop code never crosses a 16-byte...
boundary, and for the "slow" cases it does. But why is a 16-byte
boundary important, and not the 64-byte cache line size? A possible
answer for that I dug out of an old copy of Microprocessor Report
on the AMD K7, which notes that the K7 pipeline fetches 16 bytes of
instructions from the primary I-cache per cycle. *AHA!* All becomes clear!
[Yes, I know I'm running on a K8, but I'm guessing it's the same.]
As further confirmation, if one now forces a GC [CMUCL has a copying GC],
then all those functions will get moved around, and the alignments will
all change, and with them the run-times under TIME. But what does *not*
change is the pairing between the new function alignments and which
functions now run "fast" or "slow":
Here again, function starting offsets of "16*n" are "slow" and offsets
of "(16*n)+8" are "fast", even though the "same" functions have different
offsets/speeds than they did before the relocation (due to the GC).
You can repeat this sequence all day long, and you'll get the same
correlation of speed with alignment, even though for any single function
*its* alignment (and thus speed) may well vary from one GC to the next.
This makes benchmarking Lisp programs with copying garbage collectors
very "interesting", to say the least. ;-} ;-}
Looks like you are saying
a) loops should be aligned on a 16 byte boundary
b) A copying GC should maintain function alignment modulo 16
That shouldn't be too hard to fix, right?!
.
- Follow-Ups:
- Re: How much tuning does regular lisp compilers do?
- From: Rob Warnock
- Re: How much tuning does regular lisp compilers do?
- References:
- Re: How much tuning does regular lisp compilers do?
- From: Rob Warnock
- Re: How much tuning does regular lisp compilers do?
- From: Rob Warnock
- Re: How much tuning does regular lisp compilers do?
- Prev by Date: Re: An Acceptable Lisp
- Next by Date: Re: How much tuning does regular lisp compilers do?
- Previous by thread: Re: How much tuning does regular lisp compilers do?
- Next by thread: Re: How much tuning does regular lisp compilers do?
- Index(es):