Re: random writing access to a file in Python
- From: Claudio Grondi <claudio.grondi@xxxxxxxxxx>
- Date: Sun, 27 Aug 2006 09:44:30 +0200
Paul Rubin wrote:
Claudio Grondi <claudio.grondi@xxxxxxxxxx> writes:The Windows XP SP 2 '/> sort' (sorting of four Gigs of 20 byte records took 12 CPU and 18 usual hours) has, from what I could observe on the task manager, done the job in only two runs of 'copying' :
Does it mean, that in case of very large files:
the size of available memory for the sorting operation (making it
possible to work on larger chunks of data in memory) has less impact
on the actual sorting speed than
the speed of the data transfer from/to storage device(s)
Transfer speed and parallelism matters most, if the cpu can keep up.
Parallelism means you want multiple drives that you can do i/o to
simultaneously without having to seek. Random access helps simulate
this, but only somewhat. Large database installations always want
lots of parallel disks for similar reasons to this.
The basic method of sorting large files has traditionally been:
1) Read the file in pieces p(1),p(2),...,p(n) where each piece is as
big as will fit in memory. Sort each piece with your favorite
in-memory sort, and write out sorted pieces that we'll designate
r(1,1),...r(1,n). The r(a,b)'s are called "runs". Use multiple
output devices (tape or disk drives) to write out the runs, i.e. each
output tape will contain its own bunch of runs.
2) Read back a bunch of the runs in parallel, i.e. say you have
written r(1,1),r(1,2),...,r(1,5) to five separate tape drives. Read
them back simultaneously and merge them (requires very little external
memory) into a new "second level" run called r(2,1). Similarly merge
r(1,6),...,r(1,10) to the second level run r(2,2), and so forth.
3) Merge the second level runs into third level runs, and so forth
recursively until there's only one giant run. That is the sorted
output.
The amount of memory determines the size of the initial "first level"
runs, which determines how many recursive merges are needed, so more
memory helps.
The number of levels of recursion also depends on the "merge order",
i.e. the number of runs merged at each merge step. You really want
each merge to read from M physical input devices that are separate
from the output device(s).
There are obviously a lot of optimizations that the above description
is missing, and there's a lot of strategy about how to organize the
merges, e.g. if you have 8 drives, do you use 4 for input and 4 for
output, or 5 in and 3 out, or what?
1. first to a temporary file,
2. then to the final file.
In the second run the size of processed chunks between reading/writing was in the order of up to tenths of Megabytes, where in the first run in order of up to hundreds Megabytes. I suppose that the procedure behind it was as follows:
1. decision about the size of chunks to split the file into and choosing the size of required memory
2. processing the chunks with in-memory sorting them and writing to the temporary file
3. decision about the size of buffers for merge sorting the chunks into the final file, so that they all fit into the 300 MByte of used memory
4. opening as many 'pipes' as there were chunks filling all of the pipe buffers up when one of them runs out of data
5. continuously switching between reading and writing to the hard disk, writing the results of the merge sorting to the final file always when one of the buffers run out of data and then filling up all of the buffers for the next cycle (concluded from observed scattered reading, smooth writing)
The latter.
Is this some kind of production deployment problem you're working on,
or do you just have a big pile of data you need to sort once?
One of the intermediate reasons behind doing it, is an attempt to get more and better intuitive understanding what are the hardware and software limits of brute force based approaches to solving problems in the area of AI, language processing and data compression.
If you
need to deploy across multiple machines (i.e. you can't just buy a
giant machine if you need to do this sorting frequently), then I'd
suggest reading up on the subject, starting with Knuth vol 3.
Thank you much for your detailed response.
Claudio Grondi
.
- Follow-Ups:
- Re: random writing access to a file in Python
- From: Paul Rubin
- Re: random writing access to a file in Python
- References:
- random writing access to a file in Python
- From: Claudio Grondi
- Re: random writing access to a file in Python
- From: Tim Peters
- Re: random writing access to a file in Python
- From: Claudio Grondi
- Re: random writing access to a file in Python
- From: Claudio Grondi
- Re: random writing access to a file in Python
- From: Paul Rubin
- Re: random writing access to a file in Python
- From: Claudio Grondi
- Re: random writing access to a file in Python
- From: Paul Rubin
- Re: random writing access to a file in Python
- From: Claudio Grondi
- Re: random writing access to a file in Python
- From: Paul Rubin
- random writing access to a file in Python
- Prev by Date: Re: Learning Python - Have Question.
- Next by Date: avoiding file corruption
- Previous by thread: Re: random writing access to a file in Python
- Next by thread: Re: random writing access to a file in Python
- Index(es):
Relevant Pages
|