Re: Massive sort among dozens of files



Fred <hn.ft.pris@xxxxxxxxx> writes:

Hi, all:
I've got a problem, there are about 50 files with total size
of 5GB, every of them has a size of about 100MB. Each file contains
thousands of lines of data. I need to sort the data in those files by
lexicon order, and write them to a set of files. The memory is limited
to 1.5GB. What is the fastest way to perform it? Thanks for help.

sort(1) should be able to process each of the input files
independently. Once you have 50 files sorted, you can easily write a
merge program (it could even be done in bash). And use split to
split the output over several files.


Something like:

[pjb@thalassa tmp]$ i=0; for f in /data/doc/rfc/rfc111?.txt.gz ; do \
gunzip < $f|sort -u > data-$i ; i=$(( $i + 1 )) ; done
[pjb@thalassa tmp]$ ~/bin/merge data-? | split --lines=1000 -d -a 2 - sorted-


[pjb@thalassa tmp]$ wc data-?
90 789 5454 data-0
204 1542 11011 data-1
590 5203 36602 data-2
1335 10913 83819 data-3
1030 9064 66179 data-4
295 2171 16768 data-5
747 5975 42963 data-6
4956 26338 307880 data-7
974 8616 59442 data-8
4 19 139 data-9
10225 70630 630257 total
[pjb@thalassa tmp]$ wc sorted-*
1000 5009 52972 sorted-00
1000 7537 58101 sorted-01
1000 8217 62181 sorted-02
1000 9606 64981 sorted-03
1000 9625 65049 sorted-04
1000 7926 60900 sorted-05
1000 5611 70161 sorted-06
1000 6489 70032 sorted-07
1000 5368 64402 sorted-08
1000 4290 50368 sorted-09
225 952 11110 sorted-10
10225 70630 630257 total
[pjb@thalassa tmp]$ head sorted-04
for mail reflectors is provided by the NIC in the files
for mail sent to address lists comprising multiple users. In order
for message text encryption only. The CBC mode definition in IS 8372
for purposes of this RFC:
for such quantities, and the encoding mechanism defined in Section
for the AlgorithmIdentifier. An algorithm identifier for RSA is
for the IP service interface, the IP module, the local network
for the change. Understanding the model on which the whole facility
for the local computing and communication environment. There is no
for the user's name, implying that any other identifying information
[pjb@thalassa tmp]$ cat ~/bin/merge
#!/usr/local/bin/clisp -ansi -q -Kfull -E iso-8859-1
;;;; -*- mode: lisp;coding:utf-8 -*-
;;;;**************************************************************************
;;;;FILE: merge
;;;;LANGUAGE: Common-Lisp
;;;;SYSTEM: Common-Lisp
;;;;USER-INTERFACE: NONE
;;;;DESCRIPTION
;;;;
;;;; Merge the files.
;;;;
;;;;AUTHORS
;;;; <PJB> Pascal Bourguignon <pjb@xxxxxxxxxxxxxxxxx>
;;;;MODIFICATIONS
;;;; 2007-06-28 <PJB> Created.
;;;;BUGS
;;;;LEGAL
;;;; GPL
;;;;
;;;; Copyright Pascal Bourguignon 2007 - 2007
;;;;
;;;; This program is free software; you can redistribute it and/or
;;;; modify it under the terms of the GNU General Public License
;;;; as published by the Free Software Foundation; either version
;;;; 2 of the License, or (at your option) any later version.
;;;;
;;;; This program is distributed in the hope that it will be
;;;; useful, but WITHOUT ANY WARRANTY; without even the implied
;;;; warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
;;;; PURPOSE. See the GNU General Public License for more details.
;;;;
;;;; You should have received a copy of the GNU General Public
;;;; License along with this program; if not, write to the Free
;;;; Software Foundation, Inc., 59 Temple Place, Suite 330,
;;;; Boston, MA 02111-1307 USA
;;;;**************************************************************************

(defpackage "MERGE"
(:use "COMMON-LISP")
(:shadow "FILE-STREAM" "MERGE"))
(in-package "MERGE")


(defstruct file
stream
line)


(defun file-open (path)
"
DO: Open the text file at PATH and read the first line.
RETURN: The FILE structured.
"
(let ((result (make-file :stream (open path))))
(file-read-line result)
result))


(defun file-read-line (file)
"
RETURN: The line read, or NIL upon EOF.
"
(setf (file-line file) (read-line (file-stream file) nil nil)))


(defun nsort (vector test &key (key (function identity) keyp))
"
RETURN: VECTOR
DO: Same as CL:SORT, but returns VECTOR itself, keeping
the other attributes of the vector as the fill-pointer, etc.
"
(let ((sorted (if keyp
(sort vector test :key key)
(sort vector test))))
(unless (eq sorted vector)
(replace vector sorted))
vector))


(defun insert (item vector test &key (key (function identity)))
"
DO: Modify VECTOR, shifting elements smaller than ITEM (as by TEST)
from 1 below (length vector) down by one, and insert ITEM in
VECTOR.
PRE: (subseq VECTOR 1) is sorted.
POST: VECTOR is sorted.
RETURN: VECTOR
"
(loop
:for i :from 1 :below (length vector)
:while (funcall test (funcall key (aref vector i)) (funcall key item))
:do (setf (aref vector (1- i)) (aref vector i))
:finally (setf (aref vector (1- i)) item) (return vector)))


(defun vpop (vector)
"

"
(replace vector vector :start1 0 :start2 1)
(decf (fill-pointer vector))
vector)


(defun merge (inpaths)
(let* ((data (remove-if-not (function file-line) ; skip empty files.
(mapcar (function file-open) inpaths)))
(len (length data))
(files (nsort (make-array len :fill-pointer len :initial-contents data)
(function string<=)
:key (function file-line))))
(loop
:while (plusp (length files))
:do (let ((smallest (aref files 0)))
(princ (file-line smallest)) (terpri)
(if (file-read-line smallest)
(progn
(insert smallest files
(function string<=) :key (function file-line))
(nsort files
(function string<=) :key (function file-line)))
(vpop files))))))


(when ext:*args*
(merge ext:*args*))

;;;; THE END ;;;;
[pjb@thalassa tmp]$

--
__Pascal Bourguignon__ http://www.informatimago.com/

NOTE: The most fundamental particles in this product are held
together by a "gluing" force about which little is currently known
and whose adhesive power can therefore not be permanently
guaranteed.
.