Re: File Management



Also...

You can use Digest::MD5 module and create an MD5 signature for comparing the
files that have the same size.


Teddy


----- Original Message -----
From: "Xavier Noria" <fxn@xxxxxxxxxxx>
To: "beginners perl" <beginners@xxxxxxxx>
Sent: Saturday, July 23, 2005 10:46 AM
Subject: Re: File Management


> On Jul 23, 2005, at 7:56, Joel Divekar wrote:
>
> > We have a windoz based file server with thousand of
> > user accounts. Each user is having thousand of files
> > in his home directory. Most of these files are
> > duplicate / modified or updated version of the
> > existing files. These files are either .doc or . xls
> > or .ppt files which are shared by groups or
> > departments.
> >
> > Due to this my server is having terabyte of data, most
> > of which are redundant and our sysadmin has tough time
> > maintaining storage space.
> >
> > For this I though of writing a small program to locate
> > similar or duplicate files stored on my file server
> > and delete them with the help of the user. The program
> > should work very fast and I don't know from where to
> > start.
>
> Well, to come with the right solution one would need to play around a
> bit in the server. I propose an approach based on the description
> above, just in case it helps.
>
> Since there is big number of files, we need to walk the tree at least
> once, and store some data for each file to compare, I would choose a
> quick test first that speeds up the tree traversal as much as
> possible, purges the tree, and then do heavier operations on the
> remaining candidates.
>
> For instance:
>
> 1. Walk the tree and build a map using -s
>
> size -> filenames
>
> 2. Purge the entries that have just one filename associated, since
> they have no duplicate for sure
>
> 3. Work on the rest of the entries.
>
> If the map in (1) gets too big to fit in a hash in memory you could
> use some sort of database table, maybe something simple to setup as
> SQLite. For (3), if the number of candidates is still not small you
> could make an additional refinement constructing a map with MD5s,
> until you get a small number of files and can compare their contents.
>
> Trace as less as possible the tree traversal, printing to the console
> a debug line for each file, for instance, would slow down the script
> by orders of magnitude.
>
> Then, to maintain that tree, I don't know, maybe the time to do this
> is assumable? Running that procedure periodically might be a simple
> but good enough solution.
>
> -- fxn
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@xxxxxxxx
> For additional commands, e-mail: beginners-help@xxxxxxxx
> <http://learn.perl.org/> <http://learn.perl.org/first-response>
>
>

.



Relevant Pages

  • Re: In-place comparison of binary search trees?
    ... faster than "dump one tree into an array and compare the other tree to ... compare them the obvious way" space if you don't want to overwrite ... (defstruct (tree (:copier nil)) ... (defun walker-next (walker) ...
    (comp.programming)
  • Re: kernbench v0.30
    ... Con Kolivas wrote: ... It is designed to compare kernels on the same machine, ... > To compare results it should always be run in the same kernel tree. ...
    (Linux-Kernel)
  • Comparing Subtrees in XSLT
    ... Is there a way to write a template so that it will compare two XML ... tree A, with the key being simply the name of the nodes. ...
    (comp.text.xml)
  • Re: A Symbol Table Benchmark
    ... | Worst case is 54000 in an unbalanced tree. ... First compare with the item in the middle of the list, ... I'd not use one that contained a mul. ... much faster than the usual HLL-way to compare strings. ...
    (alt.lang.asm)
  • [PATCH] v9fs: Twalk memory leak
    ... v9fs leaks memory if the file server responds with Rerror to a Twalk ... tree da70110e3a0691dd0a30d037b6456b9372c20f51 ... parent ebf34c9b6fcd22338ef764b039b3ac55ed0e297b ...
    (Linux-Kernel)