Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question)



On Wed, 31 Oct 2007 13:12:16 -0700, Robert Maas, see
http://tinyurl.com/uh3t wrote:

Alert: For weeks, right up through Friday aftermoon, Google Groups was
showing search results only through Oct.13, so I had no way to find more
recent articles that mentionned my name, such as followups to stuff I
previously posted. Just Friday night GG finally enabled me to find a lot
of articles before this:
Date: Fri, 19 Oct 2007 00:45:42 +0200 From: moi
<root@xxxxxxxxxxxxxxxxxxx>
Since Friday I've been trying to catch up with backlog. I'm less than 6
days behind at the moment.
The
A B C D
E F G H
matrix can also be transposed by "one-touch-football" this is the
shuffle or permutation as explained on the wiki-page.

A Google search for: one-touch-football transpose matrix doesn't turn up
anything related except this thread itself. I don't know what you mean
by "the wiki-page".

Wiki-page has been referred to somewhere at the start of this (or
another, Mike has started more than one thread on the same subject)

http://en.wikipedia.org/wiki/In-place_matrix_transposition
One-touch-football was a term I invented.



One idea I saw hinted at elsewhere in this thread is to not store the
array in row-major or column-major order at all, but instead to
explicitly store as a hierarchial structure of sub-arrays that are each
just the right size for whatever unit is used for backing store. (For
disk-to-RAM paging, one cylinder per large sub-matrix, and one sector
per small sub-matrix, a three-level hierarchy; For RAM-to-cache paging,
one 'page' per sub-matrix, just a two-level hierarchy.) For example,
let's imagine a grossly simplified case where virtual memory allows four
numeric values per 'page', and you have a 4x4 array to process:
A1 A2 B1 B2
A3 A4 B3 B4
C1 C2 D1 D2
C3 C4 D3 D4

This is almost the Morton addressing.

The upper-left 2x2 sub-matrix is stored in one 'page', etc. So to
transpose, you do (in any random order) these four sub-tasks: - load
original sub-matrix A, transpose locally, store in destination A; - load
original sub-matrix B, transpose locally, store in destination C; - load
original sub-matrix C, transpose locally, store in destination B; - load
original sub-matrix D, transpose locally, store in destination D;

So how do you transpose locally? For this special case you simply swap
lower left and upper right elements. For more general case you have a
two-level loop which traverses the upper-right triangle in parallel with
the lower-left triangle.

So how do you write the toplevel algorithm to perform those four tasks?
For this special case you hardwire the calls in any order you want:
LoadLocalTransposeStore(0,0,0,0);
LoadLocalTransposeStore(0,1,1,0);
LoadLocalTransposeStore(1,0,0,1);
LoadLocalTransposeStore(1,1,1,1);
For more general case, you have a two-level loop which passes first two
indexes straight and second two indexes reversed (k = number of
partitions i.e. total size N divided by number of rows,cols in each
sub-matrix, in example above N=4, k=2).
for (i=0, i<k, i++)
for (j=0, j<k, j++)
LoadLocalTransposeStore(i,j,j,i);
Trivial, huh?


I guess you must have missed the posting with my code-snippet,
which used a similar trivial flip&swap method.
BTW for non-square matrices things become more complicated.

AvK




.



Relevant Pages

  • Re: SortMerge avoids thrashing (was: Baileys "two pass" FFT algorithm question)
    ... just the right size for whatever unit is used for backing store. ... original sub-matrix A, transpose locally, store in destination A; - load ...
    (comp.programming)
  • Re: need moving advice
    ... used the POD concept where they drop off a 16 foot container that you (or ... they) load and then they deliver it to your destination or store it. ...
    (soc.senior.issues)
  • Re: SortMerge avoids thrashing (was: Baileys "two pass" FFT algorithm question)
    ... What's the disk ... just the right size for whatever unit is used for backing store. ... original sub-matrix A, transpose locally, store in destination A; ... fast cache, it sounds like you can do the following: ...
    (comp.programming)
  • Re: A "killer" macro
    ... (defconstant load 8) ... (defconstant store 9) ... giving the opcode behind that mnemonic on that architecture. ... sophisticated way to work around the limitations of the "case" macro. ...
    (comp.lang.lisp)
  • Re: How Many Processor Cores Are Enough?
    ... A load by Pi is considered performed at a point in time when the ... A store by Pi is considered performed with respect to Pk (i and k ... It's defined in the Itanium manuals and is equivalent to Sparc TSO ...
    (comp.arch)