Re: random writing access to a file in Python



Paul Rubin wrote:
Claudio Grondi <claudio.grondi@xxxxxxxxxx> writes:

Does it mean, that in case of very large files:
the size of available memory for the sorting operation (making it
possible to work on larger chunks of data in memory) has less impact
on the actual sorting speed than
the speed of the data transfer from/to storage device(s)


Transfer speed and parallelism matters most, if the cpu can keep up.
Parallelism means you want multiple drives that you can do i/o to
simultaneously without having to seek. Random access helps simulate
this, but only somewhat. Large database installations always want
lots of parallel disks for similar reasons to this.

The basic method of sorting large files has traditionally been:

1) Read the file in pieces p(1),p(2),...,p(n) where each piece is as
big as will fit in memory. Sort each piece with your favorite
in-memory sort, and write out sorted pieces that we'll designate
r(1,1),...r(1,n). The r(a,b)'s are called "runs". Use multiple
output devices (tape or disk drives) to write out the runs, i.e. each
output tape will contain its own bunch of runs.

2) Read back a bunch of the runs in parallel, i.e. say you have
written r(1,1),r(1,2),...,r(1,5) to five separate tape drives. Read
them back simultaneously and merge them (requires very little external
memory) into a new "second level" run called r(2,1). Similarly merge
r(1,6),...,r(1,10) to the second level run r(2,2), and so forth.

3) Merge the second level runs into third level runs, and so forth
recursively until there's only one giant run. That is the sorted
output.

The amount of memory determines the size of the initial "first level"
runs, which determines how many recursive merges are needed, so more
memory helps.

The number of levels of recursion also depends on the "merge order",
i.e. the number of runs merged at each merge step. You really want
each merge to read from M physical input devices that are separate
from the output device(s).

There are obviously a lot of optimizations that the above description
is missing, and there's a lot of strategy about how to organize the
merges, e.g. if you have 8 drives, do you use 4 for input and 4 for
output, or 5 in and 3 out, or what?
The Windows XP SP 2 '/> sort' (sorting of four Gigs of 20 byte records took 12 CPU and 18 usual hours) has, from what I could observe on the task manager, done the job in only two runs of 'copying' :
1. first to a temporary file,
2. then to the final file.
In the second run the size of processed chunks between reading/writing was in the order of up to tenths of Megabytes, where in the first run in order of up to hundreds Megabytes. I suppose that the procedure behind it was as follows:
1. decision about the size of chunks to split the file into and choosing the size of required memory
2. processing the chunks with in-memory sorting them and writing to the temporary file
3. decision about the size of buffers for merge sorting the chunks into the final file, so that they all fit into the 300 MByte of used memory
4. opening as many 'pipes' as there were chunks filling all of the pipe buffers up when one of them runs out of data
5. continuously switching between reading and writing to the hard disk, writing the results of the merge sorting to the final file always when one of the buffers run out of data and then filling up all of the buffers for the next cycle (concluded from observed scattered reading, smooth writing)


Is this some kind of production deployment problem you're working on,
or do you just have a big pile of data you need to sort once?
The latter.
One of the intermediate reasons behind doing it, is an attempt to get more and better intuitive understanding what are the hardware and software limits of brute force based approaches to solving problems in the area of AI, language processing and data compression.

If you
need to deploy across multiple machines (i.e. you can't just buy a
giant machine if you need to do this sorting frequently), then I'd
suggest reading up on the subject, starting with Knuth vol 3.

Thank you much for your detailed response.

Claudio Grondi
.



Relevant Pages

  • Re: Fast Linear Hex Search
    ... about 110 times as many operations as sorting 100,000 items. ... If the time to sort 8,000,000 items is very short, there's not much real difference -- but I just tried an experiment, and the program crashed before it finished sorting. ... push @digits, int;} my @sorted = sort @digits; ... Since quicksort is an inplace sort, it shouldn't be a memory problem, but in every occasions the perl process memory usage jumped to reach the limit of available memory of my machine. ...
    (comp.lang.perl.moderated)
  • Re: open a very big file
    ... Suppose the file had 50 rows and you had to find the sum of the numbers. ... to load data in chunks which fit in the memory ". ... If you still run into memory issues with the above, ...
    (comp.soft-sys.matlab)
  • Re: Efficient algorithm for finding duplicates in 56-bit number
    ... |) processor time) it would take to sort millions of those values ... and that'a a heck of a big IF (if it fits in memory). ... | chunk fits in memory) and then merges the chunks. ...
    (comp.programming)
  • Re: python memory blow out
    ... >> allocates memory in 256 kB chunks and never releases it. ... Large chunks are returned to the platform C when they become ... > to release memory back to the operating system once the 256kb chunk is ... the patch may return such a 256KB chunk to the platform C ...
    (comp.lang.python)
  • block move clarification - System.Move()
    ... time, with chunks capable of being expressed in 8 bit, 16 bit and 32 bit ... I've been working on pooling memory and working out ways to make the memory ... each buffer already in existence. ... But I do not know ASM, which is why I am here. ...
    (borland.public.delphi.language.basm)