Re: Populating a dictionary, fast
- From: Michael Bacarella <mbac@xxxxxxxxxxxxx>
- Date: Sat, 10 Nov 2007 17:18:37 -0800 (PST)
That's an awfully complicated way to iterate over a file. Try this
instead:
dictionary
id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name
This takes about 45 *minutes*On my system, it takes about a minute and a half to produce a
with 8191180 entries.
Doing something similar on my system is very fast as well.
$ cat dict-8191180.py
#!/usr/bin/python
v = {}
for i in xrange(8191180):
v[i] = i
$ time ./dict-8191180.py
real 0m5.877s
user 0m4.953s
sys 0m0.924s
But...
30If I comment out the last line in the loop body it takes only about
most_seconds_ to run. This would seem to implicate the line id2name[id] =
name as being excruciatingly slow.
No, dictionary access is one of the most highly-optimized, fastest,
efficient parts of Python. What it indicates to me is that your systemis
running low on memory, and is struggling to find room for 517MB worthof
data.
If only it were so easy.
$ free
total used free shared buffers cached
Mem: 7390244 2103448 5286796 0 38996 1982756
-/+ buffers/cache: 81696 7308548
Swap: 2096472 10280 2086192
Here's your Python implementation running as badly as mine did.
$ wc -l id2name.txt
8191180 id2name.txt
$ cat cache-id2name.py
#!/usr/bin/python
id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':',1)
id = long(id)
id2name[id] = name
$ time ./cache-id2name.py
^C
I let it go 30 minutes before killing it since I had to leave. Here it is in top before I did the deed.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18802 root 25 0 1301m 1.2g 1148 R 99.9 17.1 36:05.99 cache-id2name.p
36 minutes, 05.99 seconds.
To rule out the file reading/parsing logic as culprit, here's same thing, but with
dictionary insertion removed.
$ cat nocache-id2name.py
#!/usr/bin/python
id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':',1)
id = long(id)
$ time ./nocache-id2name.py
real 0m33.518s
user 0m33.070s
sys 0m0.415s
Here's a Perl implementation running very fast.
$ cat cache-id2name.pl
#!/usr/bin/perl
my %id2name = ();
my $line;
my $id;
my $name;
open(F,"<id2name.txt");
foreach $line (<F>) {
chomp($line);
($id,$name) = split(/:/,$line,1);
$id = int($id);
$id2name{$id} = $name;
}
$ time ./cache-id2name.pl
real 0m46.363s
user 0m43.730s
sys 0m2.611s
So, you think the Python's dict implementation degrades towards O(N)
performance when it's fed millions of 64-bit pseudo-random longs?
.
- Follow-Ups:
- Re: Populating a dictionary, fast
- From: Paul Rubin
- Re: Populating a dictionary, fast
- From: Steven D'Aprano
- Re: Populating a dictionary, fast
- Prev by Date: Re: Valgrind and Python
- Next by Date: python - an eggs...
- Previous by thread: Re: Populating a dictionary, fast
- Next by thread: Re: Populating a dictionary, fast
- Index(es):
Relevant Pages
|