Re: New benchmarks (was Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question))



On Sat, 03 Nov 2007 00:59:54 -0700, mike3 wrote:

On Oct 31, 3:40 am, moi <r...@xxxxxxxxxxxxxxxxxxx> wrote:
On Wed, 31 Oct 2007 01:27:15 -0700, Robert Maas, see

http://tinyurl.com/uh3twrote:

[good stuff snipped]

Also I mentionned the sort-merge algorithm as a general took that
frequently works for rearranging lots of data. It would be reasonably
fast for your task. It's a good thing to learn how to do efficiently
in any case. But somebody else suggested a completely different
algorithm which is speciic to this task performed on square matrices,
and that sounded like it might be faster than my more general method.
So IMO you

Yes, that someone was me. I ran some tests, and my mmap() +
tile-flipping approach is about 10 times faster then Mike's
disk-exerciser. Basically because it does 10 times fewer disk-I/O.

Given your large amount of RAM and your smaller but still substantial
fast cache, it sounds like you can do the following: Read input from
one disk drive, sort en route using RAM and fast cache, write to
second disk drive, so each disk drive is running sequentially, no
thrashing there, and your in-RAM/cache algorithm was hopefully not
thrashing the fast cache either.

It is hard (and silly, IMHO) to tune cache effects for a program that
spends about .12 sec user + .24 sec sys CPU in some 3..30 sec walltime.
The CPU is idling anyway.

Note that the system's LRU caching effectively results in double
buffering. While Mike is juggling his tiny buffers, the system has
probably all of the file cached into system buffers. Constantly
hammering the LRU with writes will probably cause some 'write through'
to disk by the system.

AvK

Alright, I gave yours a try, to compare it to my so-called "disk
exerciser", using the code you provided. On my system, I got the
following timings, with a system having 768 MB of physical memory:

1024x1024 (4 MB):
real 0m0.035s
user 0m0.012s
sys 0m0.020s

2048x2048 (16 MB):
real 0m0.126s
user 0m0.064s
sys 0m0.060s

4096x4096 (64 MB):
real 0m0.552s
user 0m0.244s
sys 0m0.248s

8192x8192 (256 MB):
real 0m2.334s
user 0m0.964s
sys 0m0.936s

16384x16384 (1024 MB -- now too big to fit in physical memory): real
4m16.226s
user 0m3.284s
sys 0m4.664s

Now, the question is, why the huge hike on the last one (the ratio of
CPU to wall time is a whopping factor of 50!)? The machine has 768 MB of
RAM, and 400 or more is often free, so it should be able to get a

I hope it is not "free" but dedicated to the disk-buffer cache.

significant piece of the data it's working on into memory, shouldn't it?
No. that is not how it works. Pages are read into memory only when needed
('faulted in'). Since every page is needed exactly once, you need exactly
N*N reads and writes to disk.

But for some reason, the hard drive is being thrashed horribly angrily!
Why? It's like once it can't get it ALL into RAM, any performance gains
just go out the window and it turns into a disk hog and the drive

No you got confused.
By first writing the file from within the same program,
you actually primed the systems buffercache. If everything fits into
buffercache, it will still be there once you start transposing.
If the matrix is bigger than available bufferspace, the first (oldest)
part will already have been pushed out once you start accessing it.
When you access the first block (or earlier, or later) , the system needs
to free a buffer for it, kicking out an older block. Once you need that,
it will also have vanished. Perfectly normal.


thrashes like a fish that has just been plucked from the sea -- gives
that disk a workout like running a marathon! What gives? How does it do
on your system? Maybe my computer has a problem?

No, that's the sound of a 1GB file being accessed.

This is the testing routine that I hooked up to your snippet (I changed
float to modint since that's how my pi program works):


They are the same size, so that should work ok.

AvK
.



Relevant Pages

  • New benchmarks (was Re: SortMerge avoids thrashing (was: Baileys "two pass" FFT algorithm question))
    ... fast cache, it sounds like you can do the following: ... disk drive, sort en route using RAM and fast cache, write to second disk ... drive, so each disk drive is running sequentially, no thrashing there, ... sys 0m0.020s ...
    (comp.programming)
  • Re: New benchmarks (was Re: SortMerge avoids thrashing (was: Baileys "two pass" FFT algorithm questi
    ... fast cache, it sounds like you can do the following: ... second disk drive, so each disk drive is running sequentially, no ... While Mike is juggling his tiny buffers, ... sys 0m0.020s ...
    (comp.programming)
  • Re: Larrabee delayed: anyone know whats happening?
    ... Do you mean that there are modern systems that *can't* easily get 90% of theoretical disk bandwidth without using massive buffers simply by using multi-buffering in system RAM? ... stage is not written to enable streaming, ...
    (comp.arch)
  • Re: Larrabee delayed: anyone know whats happening?
    ... Do you mean that there are modern systems that *can't* easily get 90% of theoretical disk bandwidth without using massive buffers simply by using multi-buffering in system RAM? ... All that's required is a disk which can transfer the data for the previous request to the host and accept a subsequent request for the next data to be read while it's reading the current request's data from its platters - or issue completion notification for the previous write request and accept the data for the subsequent write request while transferring the current write request's data to its platters, capabilities present in all current SATA disks, in SCSI disks for decades, and in some disks even earlier. ...
    (comp.arch)
  • [PATCH 2/2] ext3: add data=guarded mode
    ... a workqueue where the real work of updating the on disk i_size is done. ... end_io handler for buffers that are marked as guarded. ... When we start tracking guarded buffers on a given inode, ... and it also takes a reference on the buffer head. ...
    (Linux-Kernel)