Re: Dramatic speed effect of code-data proximity



spamtrap@xxxxxxxxxx wrote:
Can somebody please explain this effect, observed using a Pentium 4 HT?
The following 32-bit code executes at a dramatically different speed
depending on whether or not the address 'var' is in the same 1 Kbyte
memory page as the code accessing it:

mov ecx,10000000H
test:
mov [var],eax
loop test

The measured timing was about 250 ms when the address 'var' was not in
the same 1 Kbyte page as the code (above or below) but about 36000 ms
(i.e. more than 100 times slower!) when the address 'var' was in the
same 1 Kbyte page as the code. I didn't believe this when I first saw
it but it seems to be easily reproducible.

I'm guessing that the data write invalidates the code cache, requiring
it to be reloaded from main memory each time around the loop. That
would explain the dramatic effect, but I'm not too sure why it should
be necessary.

Richard.
http://www.rtrussell.co.uk/
To reply by email change 'news' to my forename.

Probably aliasing in the caches. I'm cuttng and pasting ( good luck
reading) from the Intel Optimization manual.

Capacity Limits and Aliasing in Caches
There are cases where addresses with a given stride will compete for
some resource in the memory hierarchy.
Typically, caches are implemented to have multiple ways of set
associativity, with each way consisting of multiple sets of cache lines
(or
sectors in some cases). Multiple memory references that compete for the

same set of each way in a cache can cause a capacity issue. There are
aliasing conditions that apply to specific microarchitectures. Note
that
first-level cache lines are 64 bytes. Thus the least significant 6 bits
are
not considered in alias comparisons. For the Pentium 4 and Intel Xeon
processors, data is loaded into the second level cache in a sector of
128bytes, so the least significant 7 bits are not considered in alias
comparisons.
2-43

Capacity Limits in Set-Associative Caches
Capacity limits may occur if the number of outstanding memory
references that are mapped to the same set in each way of a given cache

exceeded the number of ways of that cache. The conditions that apply to

the first-level data cache and second level cache are listed below:
· L1 Set Conflicts-multiple references map to the same first-level
cache set. The conflicting condition is a stride determined by the
size of the cache in bytes, divided by the number ways. These
competing memory reference can cause excessive cache misses only
if the number of outstanding memory references exceeds the
number of ways in the working set. On Pentium 4 and Intel Xeon
processors with CPUID signature of family encoding 15, model
encoding of 0, 1 or 2, there will be an excess of first-level cache
misses for more than 4 simultaneous, competing memory references
to addresses with 2KB modulus. On Pentium 4 and Intel Xeon
processors with CPUID signature of family encoding 15, model
encoding 3, excessive first-level cache misses occur when more
than 8 simultaneous, competing references to addresses that are
apart by 2KB modulus. On Pentium M processors, a similar
condition applies to more than 8 simultaneous references to
addresses that are apart by 4KB modulus.
· L2 Set Conflicts - multiple references map to the same
second-level
cache set. The conflicting condition is also determined by the size of
the cache/the number of ways. On Pentium 4 and Intel Xeon
processors, excessive second-level cache miss occurs when more
than 8 simultaneous competing references. The stride that can cause
capacity issues are 32KB, 64KB, or 128 KB, depending of the size
of the second level cache. On Pentium M processors, the stride size
that can cause capacity issues are 128 KB or 256 KB, depending of
the size of the second level cache.

®
Aliasing Cases in the Pentium 4 and Intel Xeon?
Processors
Aliasing conditions that are specific to the Pentium 4 processor and
Intel
Xeon processor are:
· 16K for code - there can only be one of these in the trace cache
at a
time. If two traces whose starting addresses are 16K apart are in the
same working set, the symptom will be a high trace cache miss rate.
Solve this by offsetting one of the addresses by one or more bytes.
· Data conflict - can only have one instance of the data in the
first-level cache at a time. If a reference (load or store) occurs with

its linear address matching a data conflict condition with another
reference (load or store) which is under way, then the second
reference cannot begin until the first one is kicked out of the cache.
On Pentium 4 and Intel Xeon processors with CPUID signature of
family encoding 15, model encoding of 0, 1 or 2, the data conflict
condition applies to addresses having identical value in bits 15:6
(also referred to as 64K aliasing conflict). If you avoid this kind of
aliasing, you can speedup programs by a factor of three if they load
frequently from preceding stores with aliased addresses and there is
little other instruction-level parallelism available. The gain is
smaller when loads alias with other loads, which cause thrashing in
the first-level cache. On Pentium 4 and Intel Xeon processors with
CPUID signature of family encoding 15, model encoding 3, the
dataconflict condition applies to addresses having identical value in
bits 21:6.

.



Relevant Pages

  • Re: Cached memory never gets released
    ... Stock linux 2.4.26 kernel. ... Due to flash bug 3M of memory gets lost due to font memory getting lost ... The output of "free" cache number steadily grows. ... longer to exhaust all of system memory with the cache. ...
    (Linux-Kernel)
  • Re: Problem: Creating a raw binary string
    ... > While its true that a 64-bit cpu will move twice the data per instruction it ... > Memory bus width plays an important role here and unless it too is widened / ... You are forgetting the two levels of cache in the processor. ... The memory chips are addressed in Row col fashion. ...
    (alt.comp.lang.borland-delphi)
  • Re: Is Greenspun enough?
    ... Most OSes memory map executables directly from the file system so code doesn't pollute the file cache or swap space. ...
    (comp.lang.lisp)
  • Re: Superstitious learning in Computer Architecture
    ... Without a LOT of logic or some other better approach, re-executing the instructions requires re-decoding and it ties up the cache memory bus transferring more data as instructions than the instructions are working on. ... The concept of cache is fundamentally flawed in that it STILL restricts access to one word per clock cycle, when a single modern ALU can easily use 5 plus whatever is eaten up with instruction accesses. ... The size of an optimizing compiler is proportional to the SQUARE of the size of the language times the SQUARE of the complexity of the machine - because all interactions must be considered. ...
    (comp.arch.arithmetic)
  • Re: FPGA-based hardware accelerator for PC
    ... I know that in most cases the CPU ... that it contsins no cache, as BRAMs are too precious resources to be wasted ... The BRAMs are what define the opportunity, ... many threads with full associativity of memory lines using hashed MMU ...
    (comp.arch.fpga)