Re: Mailbox-style directory hashing
- From: "Peter J. Holzer" <hjp-usenet2@xxxxxx>
- Date: Fri, 3 Nov 2006 23:04:15 +0100
On 2006-11-02 18:41, s1037989@xxxxxxxxx <s1037989@xxxxxxxxx> wrote:
Peter, thanks for your thoughtful reply! :) My response below...[...]
Peter J. Holzer wrote:
On 2006-10-31 23:40, s1037989@xxxxxxxxx <s1037989@xxxxxxxxx> wrote:
I whipped up this quick and ugly script and I wanted to post it for
code review and others' benefit.
With an array such as:
qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd)
The program returns:
# perl list2fs.pl 2
/a/aa/aaa/aaaa/aaaa
/a/aa/aaa/aaab/aaab
/c/cccc
/d/dddd
Now as you can see, what this program does is take a list of filenames
and "hashifies" it like mailbox storing allowing no more than 2 (or
whatever $ARGV[0] is) filenames to be in a single directory. The
point, obviously, is if you have 100000 filenames and ext3 won't store
100000 files in a single directory, you can use this technique to break
them down.
Ext3 will happily store 100000 filenames in a directory - it just won't
be very quick in retrieving them (Even that isn't true for Linux 2.6 any
more - ext3 directories now use a structure called an "htree" to quickly
access files in huge directories). But assuming you need to use ext3
with older kernel versions or other filesystems with linear directories:
Ok, I didn't know that. I thought I had read somewhere that the limit
was 32,000 (and thus the reason for using something like reiserfs), but
that was apparently an ancient article... Oops!
AFAIK, there was never a limit on the number of files in an ext2/ext3
directory. 32 k may be the maximum number of subdirectories (because the
link count is 16 bit, and each ".." entry is a link). But you wouldn't
want more than a few 1000 files in a single directory because it was slow.
aaab aaac aaad aaae aaa
You would need a directory /a/aa/aaa and a file /a/aa/aaa. But you can't
have both.
Actually, in this case, aaa the file would go in aaa the directory.
Right. But my gut feeling that you don't handle that case correctly was
right. See below.
Do you have all the filenames in advance or is it possible to create new
files after the structure has been created? If it is the latter, you
proabably will need a way to split an existing directory if the number
of files in it becomes too large - what happens when somebody accesses
the directory in the middle of the split operation? How do you determine
when you have to split? Count files time you create a new file?
I have the majority of the filenames in advance, but unfortunately the
list will change. To be more specific about what I'm doing: the URLs
are blacklisted domains to be used with squid.
Are they URLs or domains? With URLs you may need to be careful because
of '/' and possibly other characters (especially if you access the
filesystem via Windows/Samba).
Now, that said, this is NOT intended for "hashifying" mail storage
dirs. It IS intended to "hashify" a HUGE list of filenames.
Unfortunately this code is VERY inefficient.
How did you determine that is very inefficient?
By running it... :) It was AWFUL! I ran it against a file containing
nearly 600,000 entries. I never waited long enough for it to finish.
That's probably because it doesn't finish at all if one of the strings
is a prefix of more than $ARGV[0] other strings. Change the
initialization to
my @list = @_ = qw(aa aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd);
and try it. (It will probably crash pretty soon when it runs out of
memory, but if you had an infinite amount of memory it would run
forever)
So, I post it here so people can see my idea if it helps, and so that
people can maybe direct me to an existing CPAN module that would
accomplish the same thing?
When I needed to do similar things I used a hash function (e.g. MD5 or
SHA-1) on the key (the filename in your case) to get nice, uniformly
distributed constant length filenames and then computed the number of
levels from the maximum number of files I had to store. With 256 files
per directory, 2 levels would be enough for 16 million files and 3
levels would be enough for 4 billion files. I never needed more :-).
This is good information, thanks. I'll consider it's practical
purposes, but would defeat the goal of simple blacklist updates via
Samba. :(
You want the directory structure to be created with the script anyway,
don't you? Or do you expect Windows users to create and delete
individual files to edit the blacklist?
Or, perhaps someone likes what I've started and wants to help improve
the code?
If you avoid copying around huge lists it might be faster. But I'm not
sure the code even does what you want - see my questions above.
How could I avoid that?
I was wrong. You are passing the lists by reference, and the hashes
which you copy needlessly aren't that big (at most 70 elements, assuming
printable ASCII). The problem is that you are calling grep way too often
which also causes the endless recursion (and would be slow even without
that).
Here is my version of your function "a" (you have some talent for
choosing horrible names, btw. I could only figure out what the variables
were supposed to be by reading the code - the names were at best
useless and at worst misleading):
sub a {
my ($prefix_len, $list) = @_;
#print " " x $prefix_len, (scalar @$list), " elements\n";
my %hash = ();
push @{ $hash{lc(substr($_, 0, $prefix_len))} }, $_ for @$list;
foreach my $sublist ( values %hash ) {
if ( @$sublist > $ARGV[0] ) {
$sublist = a($prefix_len+1, $sublist);
} else {
$sublist = { map { lc($_) => 1 } @$sublist };
}
}
return \%hash;
}
The biggest change is that I didn't try to figure out the prefixes first
and then find all strings with the same prefix (which caused duplication
and the endless recursion) but do that in one loop - that way each
string can only ever be assigned to one prefix and the lists must get
shorter. The switching between array and hash references is mostly so
that a can be called with an array reference as before. I could have
uses hash references throughout. It also now returns a hash reference
instead of a list-which-is-a-hash, which I consider cleaner.
I didn't have a list of 600000 domain names handy, so I used two lists
with 35000 and 96000 names: The run times (with $ARGV[0] == 1000) were
about 3 and 9 seconds, respectively, on a PIII 850.
hp
--
_ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
|_|_) | Sysadmin WSR | > ist?
| | | hjp@xxxxxx | Was sonst wäre der Sinn des Erfindens?
__/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
.
- References:
- Mailbox-style directory hashing
- From: s1037989
- Re: Mailbox-style directory hashing
- From: Peter J. Holzer
- Re: Mailbox-style directory hashing
- From: s1037989
- Mailbox-style directory hashing
- Prev by Date: Re: WWW::Mechanize - simulating complete page download
- Next by Date: Re: understanding an error
- Previous by thread: Re: Mailbox-style directory hashing
- Next by thread: Abysmal performance of Net::SFTP (and I have GMP & Pari)
- Index(es):
Relevant Pages
|