Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question)
- From: mike3 <mike4ty4@xxxxxxxxx>
- Date: Sat, 03 Nov 2007 00:02:35 -0700
On Oct 31, 1:54 pm, moi <r...@xxxxxxxxxxxxxxxxxxx> wrote:
On Wed, 31 Oct 2007 13:12:16 -0700, Robert Maas, see
http://tinyurl.com/uh3twrote:
Alert: For weeks, right up through Friday aftermoon, Google Groups was
showing search results only through Oct.13, so I had no way to find more
recent articles that mentionned my name, such as followups to stuff I
previously posted. Just Friday night GG finally enabled me to find a lot
of articles before this:
Date: Fri, 19 Oct 2007 00:45:42 +0200 From: moiSince Friday I've been trying to catch up with backlog. I'm less than 6
<r...@xxxxxxxxxxxxxxxxxxx>
days behind at the moment.
The
A B C D
E F G H
matrix can also be transposed by "one-touch-football" this is the
shuffle or permutation as explained on the wiki-page.
A Google search for: one-touch-football transpose matrix doesn't turn up
anything related except this thread itself. I don't know what you mean
by "the wiki-page".
Wiki-page has been referred to somewhere at the start of this (or
another, Mike has started more than one thread on the same subject)
http://en.wikipedia.org/wiki/In-place_matrix_transposition
One-touch-football was a term I invented.
However I haven't been able to get anything really helpful, it doesn't
describe for example how to do on-disk transpose efficiently, it just
gives reviews of various stuff.
One idea I saw hinted at elsewhere in this thread is to not store the
array in row-major or column-major order at all, but instead to
explicitly store as a hierarchial structure of sub-arrays that are each
just the right size for whatever unit is used for backing store. (For
disk-to-RAM paging, one cylinder per large sub-matrix, and one sector
per small sub-matrix, a three-level hierarchy; For RAM-to-cache paging,
one 'page' per sub-matrix, just a two-level hierarchy.) For example,
let's imagine a grossly simplified case where virtual memory allows four
numeric values per 'page', and you have a 4x4 array to process:
A1 A2 B1 B2
A3 A4 B3 B4
C1 C2 D1 D2
C3 C4 D3 D4
This is almost the Morton addressing.
The upper-left 2x2 sub-matrix is stored in one 'page', etc. So to
transpose, you do (in any random order) these four sub-tasks: - load
original sub-matrix A, transpose locally, store in destination A; - load
original sub-matrix B, transpose locally, store in destination C; - load
original sub-matrix C, transpose locally, store in destination B; - load
original sub-matrix D, transpose locally, store in destination D;
So how do you transpose locally? For this special case you simply swap
lower left and upper right elements. For more general case you have a
two-level loop which traverses the upper-right triangle in parallel with
the lower-left triangle.
So how do you write the toplevel algorithm to perform those four tasks?
For this special case you hardwire the calls in any order you want:
LoadLocalTransposeStore(0,0,0,0);
LoadLocalTransposeStore(0,1,1,0);
LoadLocalTransposeStore(1,0,0,1);
LoadLocalTransposeStore(1,1,1,1);
For more general case, you have a two-level loop which passes first two
indexes straight and second two indexes reversed (k = number of
partitions i.e. total size N divided by number of rows,cols in each
sub-matrix, in example above N=4, k=2).
for (i=0, i<k, i++)
for (j=0, j<k, j++)
LoadLocalTransposeStore(i,j,j,i);
Trivial, huh?
I guess you must have missed the posting with my code-snippet,
which used a similar trivial flip&swap method.
BTW for non-square matrices things become more complicated.
AvK
Hmm. This is interesting. How does it perform on
fairly large matrices, anyway, especially when they're
too large to all fit in main memory? I've noticed that on
my system, when I have files mapped to virtual memory
that are too large to store in main memory, the
hard drive is accessed very vigorously, even for something
as simple as a digit-by-digit addition of two huge integers
stored on the hard drive, taking much longer than, say, a
simple file copy of a similar amount of data. Why is this?
Also, for the Bailey algorithms under discussion, they
seem to require the data be in column-major order, and
it was mentioned here that the algorithm cannot be
converted to work with data in row-major order. In addition,
if one uses a discontiguous representation of the data
on disk to accelerate the transpose, it may hamper the
row/column FFTs/NTTs, which are best done when
rows/colums are stored contiguously.
.
- Prev by Date: Re: Probably more a Math than Programming question
- Next by Date: Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question)
- Previous by thread: Probably more a Math than Programming question
- Next by thread: Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question)
- Index(es):
Relevant Pages
|
|