Re: File Management
- From: tallison@xxxxxxxxxxx (Tom Allison)
- Date: Sat, 23 Jul 2005 08:30:48 -0400
Joel Divekar wrote:
Hi All
We have a windoz based file server with thousand of user accounts. Each user is having thousand of files in his home directory. Most of these files are duplicate / modified or updated version of the existing files. These files are either .doc or . xls or .ppt files which are shared by groups or departments.
Due to this my server is having terabyte of data, most of which are redundant and our sysadmin has tough time maintaining storage space.
For this I though of writing a small program to locate similar or duplicate files stored on my file server and delete them with the help of the user. The program should work very fast and I don't know from where to start.
Anybody out here to show me a direction to some links on how to start and from there I shall take up. I would also like to know long term solution for this problem if any ? I am comfortable with linux or shell programming.
Please advice. Thanks a lot.
Regards
Joel Mumbai, India 9821421965
____________________________________________________
Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs
File::Find is one possibility except that it seems to behave badly when files are being modified when the tree is being walked. My experience of 'badly' is duplication of results. Nothing work, but something to be aware of.
So you want to build a hash structure of FullPath => md5-hash
and then build a second hash of keys=>[files] and if the key has more than one filename associated with it.... Then you probably want more stat information (mtime) to decide which to purge.
This could probably be done in RAM if you are under 10^6 files.
Even if you can't hold the entire tree. You could at least do it in chunks, like only look at files within a size range until you pare things down a little.
.
- References:
- File Management
- From: Joel Divekar
- File Management
- Prev by Date: Re: wildcard matching
- Next by Date: Re: search and replace
- Previous by thread: Re: File Management
- Next by thread: Problem using "my" to protect scope
- Index(es):
Relevant Pages
|