Re: Dramatic speed effect of code-data proximity
- From: "Mark_Larson" <spamtrap@xxxxxxxxxx>
- Date: 29 Aug 2006 11:58:43 -0700
spamtrap@xxxxxxxxxx wrote:
Can somebody please explain this effect, observed using a Pentium 4 HT?
The following 32-bit code executes at a dramatically different speed
depending on whether or not the address 'var' is in the same 1 Kbyte
memory page as the code accessing it:
mov ecx,10000000H
test:
mov [var],eax
loop test
The measured timing was about 250 ms when the address 'var' was not in
the same 1 Kbyte page as the code (above or below) but about 36000 ms
(i.e. more than 100 times slower!) when the address 'var' was in the
same 1 Kbyte page as the code. I didn't believe this when I first saw
it but it seems to be easily reproducible.
I'm guessing that the data write invalidates the code cache, requiring
it to be reloaded from main memory each time around the loop. That
would explain the dramatic effect, but I'm not too sure why it should
be necessary.
Richard.
http://www.rtrussell.co.uk/
To reply by email change 'news' to my forename.
Probably aliasing in the caches. I'm cuttng and pasting ( good luck
reading) from the Intel Optimization manual.
Capacity Limits and Aliasing in Caches
There are cases where addresses with a given stride will compete for
some resource in the memory hierarchy.
Typically, caches are implemented to have multiple ways of set
associativity, with each way consisting of multiple sets of cache lines
(or
sectors in some cases). Multiple memory references that compete for the
same set of each way in a cache can cause a capacity issue. There are
aliasing conditions that apply to specific microarchitectures. Note
that
first-level cache lines are 64 bytes. Thus the least significant 6 bits
are
not considered in alias comparisons. For the Pentium 4 and Intel Xeon
processors, data is loaded into the second level cache in a sector of
128bytes, so the least significant 7 bits are not considered in alias
comparisons.
2-43
Capacity Limits in Set-Associative Caches
Capacity limits may occur if the number of outstanding memory
references that are mapped to the same set in each way of a given cache
exceeded the number of ways of that cache. The conditions that apply to
the first-level data cache and second level cache are listed below:
· L1 Set Conflicts-multiple references map to the same first-level
cache set. The conflicting condition is a stride determined by the
size of the cache in bytes, divided by the number ways. These
competing memory reference can cause excessive cache misses only
if the number of outstanding memory references exceeds the
number of ways in the working set. On Pentium 4 and Intel Xeon
processors with CPUID signature of family encoding 15, model
encoding of 0, 1 or 2, there will be an excess of first-level cache
misses for more than 4 simultaneous, competing memory references
to addresses with 2KB modulus. On Pentium 4 and Intel Xeon
processors with CPUID signature of family encoding 15, model
encoding 3, excessive first-level cache misses occur when more
than 8 simultaneous, competing references to addresses that are
apart by 2KB modulus. On Pentium M processors, a similar
condition applies to more than 8 simultaneous references to
addresses that are apart by 4KB modulus.
· L2 Set Conflicts - multiple references map to the same
second-level
cache set. The conflicting condition is also determined by the size of
the cache/the number of ways. On Pentium 4 and Intel Xeon
processors, excessive second-level cache miss occurs when more
than 8 simultaneous competing references. The stride that can cause
capacity issues are 32KB, 64KB, or 128 KB, depending of the size
of the second level cache. On Pentium M processors, the stride size
that can cause capacity issues are 128 KB or 256 KB, depending of
the size of the second level cache.
®
Aliasing Cases in the Pentium 4 and Intel Xeon?
Processors
Aliasing conditions that are specific to the Pentium 4 processor and
Intel
Xeon processor are:
· 16K for code - there can only be one of these in the trace cache
at a
time. If two traces whose starting addresses are 16K apart are in the
same working set, the symptom will be a high trace cache miss rate.
Solve this by offsetting one of the addresses by one or more bytes.
· Data conflict - can only have one instance of the data in the
first-level cache at a time. If a reference (load or store) occurs with
its linear address matching a data conflict condition with another
reference (load or store) which is under way, then the second
reference cannot begin until the first one is kicked out of the cache.
On Pentium 4 and Intel Xeon processors with CPUID signature of
family encoding 15, model encoding of 0, 1 or 2, the data conflict
condition applies to addresses having identical value in bits 15:6
(also referred to as 64K aliasing conflict). If you avoid this kind of
aliasing, you can speedup programs by a factor of three if they load
frequently from preceding stores with aliased addresses and there is
little other instruction-level parallelism available. The gain is
smaller when loads alias with other loads, which cause thrashing in
the first-level cache. On Pentium 4 and Intel Xeon processors with
CPUID signature of family encoding 15, model encoding 3, the
dataconflict condition applies to addresses having identical value in
bits 21:6.
.
- References:
- Dramatic speed effect of code-data proximity
- From: spamtrap
- Dramatic speed effect of code-data proximity
- Prev by Date: Re: Could not switch back to Real-Address mode from Protected Mode. Help?
- Next by Date: AoA setup
- Previous by thread: Re: Dramatic speed effect of code-data proximity
- Next by thread: AoA setup
- Index(es):
Relevant Pages
|