Re: How much tuning does regular lisp compilers do?



rpw3@xxxxxxxx (Rob Warnock) writes:

verec <verec@xxxxxxx> wrote:
+---------------
| rpw3@xxxxxxxx (Rob Warnock) wrote:
| [a very detailed explanation of his Lisp on Atlon cache aligment
| experiments]
|
| That's certainly more than what I expected, but then begs the
| question of how realistic such improved "cached aligned" loops
| might ne in the real world:
|
| Given the byzantine number of bytes per opcode requirement of
| the x86 ISA, it is very likely that most loops will need more
| than 16 bytes of code between the branch-back and the top of
| the loop, thus incurring a systematic "16 bytes fetch cycle"
| miss each and every time through the loop.
+---------------

In my conversations with people who *are* experienced compiler
writers [I'm not, just an amateur there], I've been told that
for algorithmic kernels keeping loops 16-byte-aligned is an
easy way to get a small but measurable improvement. I would
expect that on modern x86 (and x86_64) machines that the penalty
for misalignment would be only one or two cycles. Whether that
is important enough to you (generic "you") to affect your buying
decision about which compiler to use is something you'd have
to answer for yourself. But compiler writers *do* worry about
such things.

It's not only about time to load the cache line, but how many you'll
be loading; if a loop runs within a cache line but starts in the
middle of one, then two cache lines are consumed, and that leaves one
less physical line in the cache for swapping out when the LRU takes
over (presumably these two lines are recently used, and so will not be
flushed themselves). So a misaligned loop makes the whole machine
slower, in a sense, by giving it a smaller available cache.

But to answer the unstated related question: yes, there have
been cases historically where the compiler had to pay attention
to page crossings as well as cache line crossings. [E.g., the
infamous early MIPS R4000 bug that showed up if a jump was the
last instruction on a page, "fixed" with a change to the linker!!
The SGI compiler included a "-no_jump_at_eop" option in "ld"
for a while.]

We refused to buckle to this kind of hardware bug; it took us a year
to get SGI to even admit that they had a bug (they had cleaimed to
have fixed the bug with software, but that fix counted on the code
being at a fixed location on the page). And because we had some large
customers behind us, we finally got them to satisfy the bug issue by
noting that the number of people with that particular lot of R4000s
who also were using lisp were in the single-digits, because most of
them had already moved on, so we got them to agree to replace the
chips of anyone who requested such a replacement in a certain way.

--
Duane Rettig duane@xxxxxxxxx Franz Inc. http://www.franz.com/
555 12th St., Suite 1450 http://www.555citycenter.com/
Oakland, Ca. 94607 Phone: (510) 452-2000; Fax: (510) 452-0182
.



Relevant Pages

  • Re: Processor question
    ... I'm sure the compiler is awfully naieve though, ... Count exactly how many instructions it has to execute to get around the loop once. ... Old x86 code has to compute 32 bit operations as two 16 bit native code ops so it will be slower. ... Cache structure really matters when you are handling bulk data that is large compared to the cache sizeof the processor. ...
    (sci.electronics.design)
  • Re: hot path optimizations in uma_zalloc() & uma_zfree()
    ... > I suppose the reason of first gain lies in increasing of cpu cache hits. ... > separate buckets. ... I ran ministat against your tests with 1000 sockets loop and there isn't a lot ...
    (freebsd-hackers)
  • Re: hot path optimizations in uma_zalloc() & uma_zfree()
    ... > I suppose the reason of first gain lies in increasing of cpu cache hits. ... > separate buckets. ... I ran ministat against your tests with 1000 sockets loop and there isn't a lot ...
    (freebsd-hackers)
  • [PATCH][RFC] fast file mapping for loop
    ... are done once they hit page cache. ... loop without making it even slower than it currently is. ... * Add bio to back of pending list and wakeup thread ... * Find extent mapping this lo device block to the file block on the real ...
    (Linux-Kernel)
  • Cache size restrictions obsolete for unrolling?
    ... into the cache (i.e. the code of the loops was slightly smaller than ... execution time using a cycle-true simulator. ... unrolling factor stepwise resulting in the unrolled loop that exceeded ... I expected to get a performance decrease, i.e. the stronger the loop ...
    (comp.compilers)