Re: SortMerge avoids thrashing (was: Bailey's "two pass" FFT algorithm question)



On Oct 31, 2:40 am, moi <r...@xxxxxxxxxxxxxxxxxxx> wrote:
On Wed, 31 Oct 2007 01:27:15 -0700, Robert Maas, see

http://tinyurl.com/uh3twrote:

[good stuff snipped]

Also I mentionned the sort-merge algorithm as a general took that
frequently works for rearranging lots of data. It would be reasonably
fast for your task. It's a good thing to learn how to do efficiently in
any case. But somebody else suggested a completely different algorithm
which is speciic to this task performed on square matrices, and that
sounded like it might be faster than my more general method. So IMO you

Yes, that someone was me. I ran some tests, and my mmap() + tile-flipping
approach is about 10 times faster then Mike's disk-exerciser. Basically
because it does 10 times fewer disk-I/O.


How does it do when the file size exceeds the amount of physical
RAM in the system? For example, try a transpose with it on
a file of 1024 MB with only 512 MB of RAM. What's the disk
activity like?

On Oct 31, 1:54 pm, moi <r...@xxxxxxxxxxxxxxxxxxx> wrote:
On Wed, 31 Oct 2007 13:12:16 -0700, Robert Maas, see





http://tinyurl.com/uh3twrote:
Alert: For weeks, right up through Friday aftermoon, Google Groups was
showing search results only through Oct.13, so I had no way to find more
recent articles that mentionned my name, such as followups to stuff I
previously posted. Just Friday night GG finally enabled me to find a lot
of articles before this:
Date: Fri, 19 Oct 2007 00:45:42 +0200 From: moi
<r...@xxxxxxxxxxxxxxxxxxx>
Since Friday I've been trying to catch up with backlog. I'm less than 6
days behind at the moment.
The
A B C D
E F G H
matrix can also be transposed by "one-touch-football" this is the
shuffle or permutation as explained on the wiki-page.

A Google search for: one-touch-football transpose matrix doesn't turn up
anything related except this thread itself. I don't know what you mean
by "the wiki-page".

Wiki-page has been referred to somewhere at the start of this (or
another, Mike has started more than one thread on the same subject)

http://en.wikipedia.org/wiki/In-place_matrix_transposition
One-touch-football was a term I invented.


However I haven't been able to get anything really helpful, it doesn't
describe for example how to do on-disk transpose efficiently, it just
gives reviews of various stuff.



One idea I saw hinted at elsewhere in this thread is to not store the
array in row-major or column-major order at all, but instead to
explicitly store as a hierarchial structure of sub-arrays that are each
just the right size for whatever unit is used for backing store. (For
disk-to-RAM paging, one cylinder per large sub-matrix, and one sector
per small sub-matrix, a three-level hierarchy; For RAM-to-cache paging,
one 'page' per sub-matrix, just a two-level hierarchy.) For example,
let's imagine a grossly simplified case where virtual memory allows four
numeric values per 'page', and you have a 4x4 array to process:
A1 A2 B1 B2
A3 A4 B3 B4
C1 C2 D1 D2
C3 C4 D3 D4

This is almost the Morton addressing.





The upper-left 2x2 sub-matrix is stored in one 'page', etc. So to
transpose, you do (in any random order) these four sub-tasks: - load
original sub-matrix A, transpose locally, store in destination A; - load
original sub-matrix B, transpose locally, store in destination C; - load
original sub-matrix C, transpose locally, store in destination B; - load
original sub-matrix D, transpose locally, store in destination D;

So how do you transpose locally? For this special case you simply swap
lower left and upper right elements. For more general case you have a
two-level loop which traverses the upper-right triangle in parallel with
the lower-left triangle.

So how do you write the toplevel algorithm to perform those four tasks?
For this special case you hardwire the calls in any order you want:
LoadLocalTransposeStore(0,0,0,0);
LoadLocalTransposeStore(0,1,1,0);
LoadLocalTransposeStore(1,0,0,1);
LoadLocalTransposeStore(1,1,1,1);
For more general case, you have a two-level loop which passes first two
indexes straight and second two indexes reversed (k = number of
partitions i.e. total size N divided by number of rows,cols in each
sub-matrix, in example above N=4, k=2).
for (i=0, i<k, i++)
for (j=0, j<k, j++)
LoadLocalTransposeStore(i,j,j,i);
Trivial, huh?

I guess you must have missed the posting with my code-snippet,
which used a similar trivial flip&swap method.
BTW for non-square matrices things become more complicated.

AvK

Hmm. This is interesting. How does it perform on
fairly large matrices, anyway, especially when they're
too large to all fit in main memory? I've noticed that on
my system, when I have files mapped to virtual memory
that are too large to store in main memory, the
hard drive is accessed very vigorously, even for something
as simple as a digit-by-digit addition of two huge integers
stored on the hard drive, taking much longer than, say, a
simple file copy of a similar amount of data. Why is this?

Also, for the Bailey algorithms under discussion, they
seem to require the data be in column-major order, and
it was mentioned here that the algorithm cannot be
converted to work with data in row-major order. In addition,
if one uses a discontiguous representation of the data
on disk to accelerate the transpose, it may hamper the
row/column FFTs/NTTs, which are best done when
rows/colums are stored contiguously.

Given your large amount of RAM and your smaller but still substantial
fast cache, it sounds like you can do the following: Read input from one
disk drive, sort en route using RAM and fast cache, write to second disk
drive, so each disk drive is running sequentially, no thrashing there,
and your in-RAM/cache algorithm was hopefully not thrashing the fast
cache either.

It is hard (and silly, IMHO) to tune cache effects for a program that
spends about .12 sec user + .24 sec sys CPU in some 3..30 sec walltime.
The CPU is idling anyway.


How do I get rid of that, anyway? How do I make the wall time
as low as possible?

Note that the system's LRU caching effectively results in double
buffering. While Mike is juggling his tiny buffers, the system has
probably all of the file cached into system buffers. Constantly hammering
the LRU with writes will probably cause some 'write through' to disk by
the system.

AvK


.



Relevant Pages

  • Re: How do you deal with a lot of e-mail?
    ... Your reply is advocating staying away from large disk storage systems. ... any single exchange store would be greater than 100GB each. ... Storing on a SAN is simply the only way to go these days. ... By spreading the disk load across a SAN or multiple SAN's ...
    (microsoft.public.exchange.admin)
  • Re: SortMerge avoids thrashing (was: Baileys "two pass" FFT algorithm question)
    ... just the right size for whatever unit is used for backing store. ... original sub-matrix A, transpose locally, store in destination A; - load ...
    (comp.programming)
  • Re: SortMerge avoids thrashing (was: Baileys "two pass" FFT algorithm question)
    ... One idea I saw hinted at elsewhere in this thread is to not store the ... original sub-matrix A, transpose locally, store in destination A; - load ...
    (comp.programming)
  • Mailbox store dismounted automatically
    ... Both of Exchange mailbox stores dismounted automatically, ... Database error 0xfffffd9a occurred in function JTAB_BASE::EcUpdate while ... Component: Information Store ... A critical input/output error or disk error has occurred. ...
    (microsoft.public.exchange.admin)
  • Re: OT: Ripping Tapes & LPs
    ... >> album, that you can take out of the unit and label and store on a shelf, ... I examined the disk and it was like someone had thrown it ... on the shelf if they'd noticed the condition. ... When I took it back to the store and complained, ...
    (rec.arts.sf.tv.babylon5.moderated)