Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question)
- From: rem642b@xxxxxxxxx (Robert Maas, see http://tinyurl.com/uh3t)
- Date: Wed, 31 Oct 2007 13:12:16 -0700
Alert: For weeks, right up through Friday aftermoon, Google Groups
was showing search results only through Oct.13, so I had no way to
find more recent articles that mentionned my name, such as
followups to stuff I previously posted. Just Friday night GG
finally enabled me to find a lot of articles before this:
Date: Fri, 19 Oct 2007 00:45:42 +0200Since Friday I've been trying to catch up with backlog.
From: moi <root@xxxxxxxxxxxxxxxxxxx>
I'm less than 6 days behind at the moment.
The
A B C D
E F G H
matrix can also be transposed by "one-touch-football"
this is the shuffle or permutation as explained on the wiki-page.
A Google search for: one-touch-football transpose matrix
doesn't turn up anything related except this thread itself.
I don't know what you mean by "the wiki-page".
However in general I like the idea of breaking the data into chunks
of size that match the backing store to assure locality of reference.
For fast cache backed by RAM, whatever unit of RAM is in each
pagemap entry is the appropriate unit, since RAM is by definition
Random Access, you can load data from any pagemap unit at any time
without any cost beyond the actual access to those memory
locations. Based on the OP's later info, this seems appropriate.
For RAM backed by disk, the key thing which takes the most time is
seeking. First you need absolute control over disk allocation so
that you can store your data in contiguous disk space so as to minimize
the total number of cylinders.
If your data takes more than one cylinder, then the first thing you
need to do is localize your cylinder access so that you never go
back to a cylinder you already were using before, you load
everything you need from one cylinder before moving to the next.
If your dataset fits on one cylinder, then no problem.
Secondarily, you need to localize sector reads within that one
cylinder you are using at the moment, and in fact you need to
arrange your sectors around each track in such a way that you have
time to get ready to read each sector before that sector comes up,
to avoid missing the sector and having to wait a whole rotation of
the disk for it to come around again. Asynchronous interrupt-driven
I/O and/or interleaved sector addressing may solve the problem if
your CPU isn't fast enough to read data at full disk rotation
speed. For example, you might use 2:1 interleaving of sector
addresses, so that "sequential" access of all the sectors in a
track requires two complete rotations:
-------physical-placement-around-the-track------------------------------------>Sec#0 Sec#1 Sec#2 Sec#3 Sec#4 Sec#5 Sec#6 Sec#7
Sec#8 Sec#9 Sec#A Sec#B Sec#C Sec#D Sec#E Sec#F
If you tried to pack all those sectors sequentially and your CPU
isn't ready for the next sector until a tad bit too late, you'd
miss the next sector each time after the first and have to go all
the way around to get it, resulting in fifteen complete rotations
to read the sixteen sectors. When you format a disk, you generally
have control over the level of interleaving of sector addresses
within the track, and you select which level of interleaving based
on how fast a CPU you have relative to the speed of the disk.
Of course in any case you need to read each sector of data just
once and never read the same sector again.
You seem too much obsessed with sequentiality.
This may be good for the inner loops {read row(A), read column(B), write
SUM(a*b) to destination; } but for transposing either A or B, you'll have
to *reorder*, which implies nonsequential access, either at the source or
at the destination.
Yes, that's the same point I was making. The sequence of access
between source and destination is grossly different, so at least
one of the two must be non-sequential. Merge-sort is a multi-pass
algorithm that assures almost-sequential access, which is good
enough for a reasonable implementation. But completely foregoing
the ease of programming by explicitly handling chunks that aren't
rows or columns might be even faster.
One idea I saw hinted at elsewhere in this thread is to not store
the array in row-major or column-major order at all, but instead to
explicitly store as a hierarchial structure of sub-arrays that are
each just the right size for whatever unit is used for backing
store. (For disk-to-RAM paging, one cylinder per large sub-matrix,
and one sector per small sub-matrix, a three-level hierarchy; For
RAM-to-cache paging, one 'page' per sub-matrix, just a two-level
hierarchy.) For example, let's imagine a grossly simplified case
where virtual memory allows four numeric values per 'page', and you
have a 4x4 array to process:
A1 A2 B1 B2
A3 A4 B3 B4
C1 C2 D1 D2
C3 C4 D3 D4
The upper-left 2x2 sub-matrix is stored in one 'page', etc.
So to transpose, you do (in any random order) these four sub-tasks:
- load original sub-matrix A, transpose locally, store in destination A;
- load original sub-matrix B, transpose locally, store in destination C;
- load original sub-matrix C, transpose locally, store in destination B;
- load original sub-matrix D, transpose locally, store in destination D;
So how do you transpose locally? For this special case you simply
swap lower left and upper right elements. For more general case you
have a two-level loop which traverses the upper-right triangle in
parallel with the lower-left triangle.
So how do you write the toplevel algorithm to perform those four tasks?
For this special case you hardwire the calls in any order you want:
LoadLocalTransposeStore(0,0,0,0);
LoadLocalTransposeStore(0,1,1,0);
LoadLocalTransposeStore(1,0,0,1);
LoadLocalTransposeStore(1,1,1,1);
For more general case, you have a two-level loop which passes first
two indexes straight and second two indexes reversed (k = number of
partitions i.e. total size N divided by number of rows,cols in each
sub-matrix, in example above N=4, k=2).
for (i=0, i<k, i++)
for (j=0, j<k, j++)
LoadLocalTransposeStore(i,j,j,i);
Trivial, huh?
Now the *usual* algorithms, such as matrix multiplication which
involves dot product of row from one matrix and transposed column
of other matrix, gets more complicated but not horribly so.
You do *not* perform a dot product of a whole row at a time.
You interleave all the dot products of all the rows that pass
through one row of sub-matrices. I'll leave it to you to work out
the details.
Ao before you start *any* of this, you need to make a complete list
of what matrix operations you need. I'm guessing you mostly need
matrix multiplication, and maybe eigenvalues, right?
Let me go back to the original post to remind myself what the real
purpose of this all was ... OK, you want to do FFT on large sets of
data from the disk and write back out to disk. I've never written
an FFT algorithm, but I vaguely read about the "butterfly" dataflow
used by it, let me check on WikiPedia to make sure I understand it
... unfortunately http://en.wikipedia.org/wiki/Butterfly_diagram
shows most of the useful info in a png file which I have no way to
view here. OK, here's a nice text description of the algorithm:
<http://www.relisoft.com/Science/Physics/fft.html>
I'll comment on locality of reference for each step:
1. Select N that is a power of two. You'll be calculating an N-point
FFT.
2. Gather your samples into a buffer of size N
Sweeps your RAM once when you load original data from disk.
3. Sort the samples in bit-reversed order and put them in a complex
N-point buffer (set the imaginary parts to zero)
;This takes n log(n) time, and using either MergeSort or hardwired
; equivalent specifically for FFT you have nice locality of reference
; for each pass. You sweep n units of memory log(n) times total.
4. Apply the first stage butterfly using adjacent pairs of numbers in
the buffer
;Complete locality of reference, swapping two bites within exact same page.
5. Apply the second stage butterfly using pairs that are separated by
2
;Complete locality of reference, assuming your pages are at least 4
;bytes each.
6. Apply the third stage butterfly using pairs that are separated by
4
;Complete locality of reference, assuming your pages are at least 8
;bytes each.
7. Continue butterflying the numbers in your buffer until you get to
separation of N/2
;At some point, you will be swapping elements that are in
; *different* pages, because the distance you are swapping is larger
; than the size of a page. But in that case, you have only two active
; pages at any one time, and once you finish those two pages you
; never return to them again (during this *step* in the overall
; algorithm, among log(n) steps total).
8. The buffer will contain the Fourier transform
Sweeps your RAM once when you store result to disk.
So my question: Why isn't that straightforward algorithm what you
have chosen to implement? You sweep the pages once per merge pass,
log(n) total (three active pages at any time, two inputs and one
output), then you sweep the pages once per butterfly pass, again
log(n) total (one or two active pages any any one time depending on
whether elements swapped are in same page or not), hence 2 * log(n)
sweeps of pages altogether, plus one sweep when you originally read
the data from disk, and another sweep at the end when you write the
FFT to disk. Isn't that good enough?
My guess: You got fascinated by Bailey's algorithm and decided to
see if you could make it as efficient (in regard to paging
behaviour) as the utterly simple algorithm described by relisoft.com?
One little note to what I said about memory sweeps: If you have
*real* data, and you compute *complex* FFT result, then the memory
sweeps for merge at the start are *real* values whereas the
butterfly swaps are *complex* values, so the butterfly sweeps use
twice as much actual memory as the merge sweeps, minor detail.
This doesn't affect the locality of reference, just the actual run
time is multipled by 1.5. If you have complex input too, like if
you do a subsequent reverse FFT, then all data is complex, so
multiply by 2.
One other little note: Note the merge-sort passes are *not*
general, you don't have to do any comparisons, you don't have to
tag the data with the destination location, you just zipper two
input runs into one output run by alternation, like a "perfect shuffle".
<ot>
Science fiction idea: You have "smart cards", ordinary playing
cards (Contract Bridge, Poker, etc.) which use nanotechnology to
commuicate with neighbors to contruct a model of their relative
locations, then when you try to shuffle them, they bend themselves
so that they deliberately contrive to arrange themselves into
consecutive sequence. That is if the card falling from your left
hand is lower than the the card falling from your right hand, the
card in your left hand bends downward to fall first while the card
in your right hand bends upward to fall later. Of course they
detect sequence breaks within each half-deck in order to detect end
of merge runs from previous shuffle (or chance merge runs at start)
in order to guarantee optimal merge-sort behaviour. Since
log(52)<6, all it takes is six "smart shuffles" between
equally-sized half decks to totally spoil the game!
</ot>
.
- Follow-Ups:
- References:
- Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question)
- From: mike3
- Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question)
- From: moi
- Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question)
- From: mike3
- Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question)
- From: moi
- Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question)
- Prev by Date: Re: Calculate the precision of a floating point number (ie: the number of decimal places)
- Next by Date: Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question)
- Previous by thread: Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question)
- Next by thread: Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question)
- Index(es):