Re: Directory sorted in chronological order

From: Uri Guttman (uri_at_stemsystems.com)
Date: 05/06/04


Date: Thu, 06 May 2004 00:50:53 GMT


>>>>> "SH" == Sam Holden <sholden@flexal.cs.usyd.edu.au> writes:

  SH> On Thu, 06 May 2004 01:23:35 +0200, Tore Aursand <tore@aursand.no> wrote:
>> On Wed, 05 May 2004 16:24:00 -0600, Jim Cochrane wrote:
>>> my @dir =
>>> map { $_->[0] }
>>> sort { $a->[1] <=> $b->[1] }
>>> map { [$_, -M] } readdir(DIR);
>>
>> my @dir = sort { -M $a <=> -M $b } readdir( DIR );
>>
  SH> The ST is an optimisation technique and as such is never neccesary.

  SH> However, this case is one where it is almost certainly useful. -M
  SH> results in a stat system call and system calls are slow, even
  SH> worse it may require waiting dor a disk read which is even slower
  SH> (though filesystem level caching should prevent that happening
  SH> more than once on machines without much other disk access
  SH> happening). To make matters even worse stat calls need to access
  SH> the file based on its name, which is linear in the number of files
  SH> in the directory.

  SH> So each -M in the sort comparison function is O(N) (assuming a crappy
  SH> filesystem with such linear access - which is normal) with possibly a
  SH> large constant (hitting the disk). The sort comparison function is
  SH> called O(N*logN) times and hence we have O(N*N*logN).

  SH> The ST version gives us an O(1) comparison function and hence O(N*logN)
  SH> for the sort, which is *much* better.

all of the sort optimization techniques (ST, GRT and orcish) do
that. the real savings is in converting any O( N log N ) work to O(N)
work (running -M on each file only once). you have to count the work of
-M as a constant in general. it may be O(N) when scanning the dir but
that will be cached in ram for sure as it will be hit often. the inodes
could be scattered around (or clustered on some file systems) but that
could be more disk or other complex access. disk accesses outweigh any
linear scan in ram times so ignore the file name search time.

  SH> This is the first example I've seen where the ST actually reduces the
  SH> runtime complexity of the sort by more than a constant factor - since
  SH> it's the first time I've noticed a sort where the comparison function
  SH> isn't O(1) in terms of the size of the list being sorted.

your analysis is slightly off IMO. you need to count inode accesses as
the main work and that is still O(N log N) and with caching who knows?
factoring out the -M to O(N) time is the win.

and the GRT factors out even more by removing the callback code itself
and using the internal cmp function.

and a perfect time to plug my soon to be released Sort::Maker
module. here is how it would be used to solve this:

        use Sort::Maker ;

        my $M_sorter = make_sorter( 'ST', number => '-M' ) ;

        my @sorted = $M_sorter->( readdir(DIR) ) ;

or with the GRT for more speed:

        my $M_sorter = make_sorter( 'GRT',
                        number => {
                                code => '-M',
                                unsigned_float => 1,
                        }
        ) ;

        my @sorted = $M_sorter->( readdir(DIR) ) ;

if anyone wants to get a beta of this module i can get you the url. it
works for all 4 sort styles and has decent pod. it need many more tests,
benchmarks and examples in pod. if you want to be a beta tester and
provide any feedback, email me. my goal is to release it before
yapc::buffalo as i am giving a talk on it there.

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs  ----------------------------  http://jobs.perl.org


Relevant Pages

  • Re: Calculus XOR Probability
    ... I don''t know what you mean by "a line of some sort, ... A disk contradicts a line? ... No, but if you say "you have an elephant in your backyard", and I say ... A disk is not a sequence of points. ...
    (sci.math)
  • Re: "Depreciated" Parameters In SPFILE
    ... the SORT_AREA_SIZE did affect the number of sorts to disk, ... switched from a one pass sort to an optimal sort. ... Total IO sort cost: 4931 Total CPU sort cost: 992293034 ... total temp space used = estimated amount of temporary space needed ...
    (comp.databases.oracle.server)
  • Re: Sorting algorithms was: Factorial
    ... I usually need the sort ... 30-50MByte/sec from Disk to Disk. ... DstFile on the same disk, respectively 25 seconds with DstFile on another ... temporary files (Tmp1 and Tmp2) should be used instead of the merge-arrays. ...
    (microsoft.public.vb.general.discussion)
  • Re: Directory sorted in chronological order
    ... # time perl testtime1.pl ... > SH> worse it may require waiting dor a disk read which is even slower ... To make matters even worse stat calls need to access ... The sort comparison function is ...
    (comp.lang.perl.misc)