New benchmarks (was Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question))



On Oct 31, 3:40 am, moi <r...@xxxxxxxxxxxxxxxxxxx> wrote:
On Wed, 31 Oct 2007 01:27:15 -0700, Robert Maas, see

http://tinyurl.com/uh3twrote:

[good stuff snipped]

Also I mentionned the sort-merge algorithm as a general took that
frequently works for rearranging lots of data. It would be reasonably
fast for your task. It's a good thing to learn how to do efficiently in
any case. But somebody else suggested a completely different algorithm
which is speciic to this task performed on square matrices, and that
sounded like it might be faster than my more general method. So IMO you

Yes, that someone was me. I ran some tests, and my mmap() + tile-flipping
approach is about 10 times faster then Mike's disk-exerciser. Basically
because it does 10 times fewer disk-I/O.

Given your large amount of RAM and your smaller but still substantial
fast cache, it sounds like you can do the following: Read input from one
disk drive, sort en route using RAM and fast cache, write to second disk
drive, so each disk drive is running sequentially, no thrashing there,
and your in-RAM/cache algorithm was hopefully not thrashing the fast
cache either.

It is hard (and silly, IMHO) to tune cache effects for a program that
spends about .12 sec user + .24 sec sys CPU in some 3..30 sec walltime.
The CPU is idling anyway.

Note that the system's LRU caching effectively results in double
buffering. While Mike is juggling his tiny buffers, the system has
probably all of the file cached into system buffers. Constantly hammering
the LRU with writes will probably cause some 'write through' to disk by
the system.

AvK

Alright, I gave yours a try, to compare it to my so-called "disk
exerciser", using the code you provided. On my system, I got
the following timings, with a system having 768 MB of physical
memory:

1024x1024 (4 MB):
real 0m0.035s
user 0m0.012s
sys 0m0.020s

2048x2048 (16 MB):
real 0m0.126s
user 0m0.064s
sys 0m0.060s

4096x4096 (64 MB):
real 0m0.552s
user 0m0.244s
sys 0m0.248s

8192x8192 (256 MB):
real 0m2.334s
user 0m0.964s
sys 0m0.936s

16384x16384 (1024 MB -- now too big to fit in physical memory):
real 4m16.226s
user 0m3.284s
sys 0m4.664s

Now, the question is, why the huge hike on the last one (the
ratio of CPU to wall time is a whopping factor of 50!)? The
machine has 768 MB of RAM, and 400 or more is often
free, so it should be able to get a significant piece of the
data it's working on into memory, shouldn't it? But for some
reason, the hard drive is being thrashed horribly angrily! Why?
It's like once it can't get it ALL into RAM, any performance
gains just go out the window and it turns into a disk hog and
the drive thrashes like a fish that has just been plucked from the
sea -- gives that disk a workout like running a marathon! What
gives? How does it do on your system? Maybe my computer
has a problem?

This is the testing routine that I hooked up to your snippet (I
changed float to modint since that's how my pi program works):

int main(int argc, char **argv)
{
FILE *fd;
struct square *mtx;
int i;
struct square buf;
clock_t startcpu, endcpu;
time_t startwall, endwall;
int rowsize, rowblocks;

/* Get commandline */
if(argc != 2)
{
printf("usage: fastxpose <row size>\n");
return(1);
} else {
rowsize = atoi(argv[1]);
rowblocks = rowsize/32;
}

/* Open matrix file */
fd = fopen("matrix.tmp", "wb+");

printf("Initializing file..."); fflush(stdout);

/* Fill it up with dummy data */
for(i=0;i<1024;i++) buf.data[i] = 0;
for(i=0;i<rowblocks*rowblocks;i++)
fwrite(&buf, sizeof(struct square), 1, fd);
fflush(fd);

printf("done.\n"); fflush(stdout);

/* Mmap the file */
mtx = mmap(0, rowsize*rowsize*sizeof(modint), PROT_READ |
PROT_WRITE,
MAP_SHARED, fileno(fd), 0);

/* Do the transposition */
startcpu = clock(); /* CPU time */
time(&startwall); /* wall time */

printf("Transposing file (%d x %d, %d MB of data)...\n", rowsize,
rowsize, rowsize*rowsize*sizeof(modint)/1048576);
fflush(stdout);

transpose(mtx, rowblocks);

endcpu = clock();
time(&endwall);

printf("Transposition complete.\n");
printf("CPU time required: %ld sec.\n", (endcpu-startcpu)/
CLOCKS_PER_SEC);
printf("Wall time required: %ld sec.\n", (endwall-startwall));

/* Clean up */
printf("Cleaning up temporary file...");
munmap(mtx, rowsize*rowsize*sizeof(modint));
fclose(fd);
remove("matrix.tmp");
printf("done!\n");

return(0);
}

.



Relevant Pages