Re: Program performance/optimisation
- From: "Rob Thorpe" <robert.thorpe@xxxxxxxxxxxx>
- Date: 14 Sep 2006 08:09:34 -0700
grid wrote:
Jack Klein wrote:
On Tue, 12 Sep 2006 21:19:34 +0530, grid <prohit99@xxxxxxxxx> wrote in
comp.lang.c:
Hi,
I was exploring the affect of cache on program
performance/optimisation.Is it the compilers responsibility only to
consider this kind of optimisation or the programmer can do his bit in
this case ?
Generally the algorithm chosen has a strong effect on the cache
performance of a program. For example a program that accesses a 1MB
block of memory randomly will have much worse cache performance to one
that accesses that block in sequence. Similarly a program that uses
large amounts of data at the same time has worse cache performance than
one that uses less. Sometimes of-course processing lots of stuff at
once is necessary.
The compiler may have some effect, especially on the instruction cache,
but it's generally a weaker effect than the programmer.
Any special effort a particular compiler makes to use cache, or any
other hardware feature of the platform, is completely a QOI (Quality
Of Implementation) issue, not a language one. The C language and its
standard define the operation of a correctly written program. They
make no mention of, nor do they place any requirements, on the speed
of efficiency, or any program.
What Jack says here doesn't contradict what I have said, since he's
talking about the language itself. Which says nothing about optimality
Reading through the "Expert C Programming" text,it mentions how the
below program can be efficient taking the cache details into accont.
The below program can be executed using the two versions of copy
alternatively and running the time command on the executable on Unix,to
see the difference.As obvious,the slowdown happens in DUMBCOPY.
#include<stdio.h>
#include<string.h>
#define DUMBCOPY for (i = 0; i < 65536; i++) \
destination[i] = source[i]
#define SMARTCOPY memcpy(destination, source, 65536)
int main()
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; j++)
SMARTCOPY;
/* DUMBCOPY; */
return 0;
}
Below are the reasonings :
The slowdown happens because the source and destination are an exact
multiple of the cache size apart.The particular algorithm used happens
to fill the same line for main memory addresses that are exact multiples
of the cache size apart.
In this particular case both the source and destination use the same
cache line, causing every memory reference to miss the cache and stall
the processor while it waited for regular memory to deliver. The library
memcpy() routine is especially tuned for high performance.
It unrolls the loop to read for one cache line and then write, which
avoids the problem.Using the smart copy, we were able to get a huge
performance improvement. This also shows the folly of drawing
conclusions from simple-minded benchmark programs.
This is possibly correct. There are other possible explanations.
How fast this code is or isn't is very platform dependent, compiler
dependent & library dependent.
I dont fully understand the above 2 paragraphs,so if someone could give
a better explanation.Would also appreciate any helpful pointers.
The example relies on all kinds of assumptions about how
microprocessors work.
The size of the two arrays 65535 elements is an attempt to ensure that
they must use the same cache line. To understand how this works you'll
need to read about microprocessor design. A book like "Computer
architecuture: A quantative approach" by Hennesey & Patterson would be
one way to do that, I'm sure there's lots of information about it
elsewhere though.
The explanation isn't necessarily correct. It could be much simpler.
Generally memcpy is faster than a loop because it copies a larger
amount at a time. That is, if you ask to copy 65535 character it will
copy 65532 of them by treating blocks of 4 bytes as 32-bit integers.
It will then copy the remaining 3 bytes with normal character copies.
The advantage of this is that there is less loop overhead when copying
32-bit integers. When copying integers or characters the processor
must increment the value i for every copy which takes time. But when
copying integers 4 characters are copied for each unit of overhead
instead of one. This is a form of loop unrolling. 64-bit processor may
go further by using 64-bit (8char) copies.
The effect the unrolling has on the cache is incidental, but may be
beneficial.
All of this is entirely platform dependent, but the point about copying
is true for many common platforms.
This might not be something directly related to C,but I thought I would
get better answers in this newsgroup and hence the posting.
Actually, your question is completely off-topic here. As far as C is
concerned, there is no such thing as a cache, cache line, or processor
stall. This is all quite hardware and architecture dependent.
You need to ask questions about this in some sort of platform specific
newsgroup. The moderated group news:comp.lang.asm.x86 is a good place
to discuss the behavior of such things as cache on x86 processors.
You'll have to look to find an appropriate group for other processor
architectures.
Since asking it on a platform specific group would give me all the
platform specific details.I was looking for some generic knowledge which
I can further investigate for the particular platform.
Can I get some good answers outta here.
comp.programming will talk about anything, but this isn't it's
specialist subject. comp.arch or comp.lang.asm.x86 may be better.
.
- References:
- Re: Program performance/optimisation
- From: grid
- Re: Program performance/optimisation
- Prev by Date: Re: Style: using functions/methods/procedures
- Next by Date: Re: A survey for the school project
- Previous by thread: Re: Program performance/optimisation
- Next by thread: Re: Program performance/optimisation
- Index(es):
Relevant Pages
|