Re: Populating a dictionary, fast



That's an awfully complicated way to iterate over a file. Try this
instead:


id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name

This takes about 45 *minutes*

On my system, it takes about a minute and a half to produce a
dictionary
with 8191180 entries.

Doing something similar on my system is very fast as well.

$ cat dict-8191180.py

#!/usr/bin/python



v = {}

for i in xrange(8191180):

v[i] = i


$ time ./dict-8191180.py

real 0m5.877s
user 0m4.953s
sys 0m0.924s

But...

If I comment out the last line in the loop body it takes only about
30
_seconds_ to run. This would seem to implicate the line id2name[id] =
name as being excruciatingly slow.

No, dictionary access is one of the most highly-optimized, fastest,
most
efficient parts of Python. What it indicates to me is that your system
is
running low on memory, and is struggling to find room for 517MB worth
of
data.


If only it were so easy.




$ free


total used free shared buffers cached


Mem: 7390244 2103448 5286796 0 38996 1982756


-/+ buffers/cache: 81696 7308548


Swap: 2096472 10280 2086192





Here's your Python implementation running as badly as mine did.



$ wc -l id2name.txt

8191180 id2name.txt



$ cat cache-id2name.py

#!/usr/bin/python



id2name = {}

for line in open('id2name.txt'):

id,name = line.strip().split(':',1)

id = long(id)

id2name[id] = name



$ time ./cache-id2name.py

^C

I let it go 30 minutes before killing it since I had to leave. Here it is in top before I did the deed.


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

18802 root 25 0 1301m 1.2g 1148 R 99.9 17.1 36:05.99 cache-id2name.p



36 minutes, 05.99 seconds.



To rule out the file reading/parsing logic as culprit, here's same thing, but with

dictionary insertion removed.



$ cat nocache-id2name.py

#!/usr/bin/python



id2name = {}

for line in open('id2name.txt'):

id,name = line.strip().split(':',1)

id = long(id)



$ time ./nocache-id2name.py



real 0m33.518s

user 0m33.070s

sys 0m0.415s





Here's a Perl implementation running very fast.



$ cat cache-id2name.pl

#!/usr/bin/perl



my %id2name = ();

my $line;

my $id;

my $name;

open(F,"<id2name.txt");



foreach $line (<F>) {

chomp($line);

($id,$name) = split(/:/,$line,1);

$id = int($id);

$id2name{$id} = $name;

}



$ time ./cache-id2name.pl



real 0m46.363s

user 0m43.730s

sys 0m2.611s





So, you think the Python's dict implementation degrades towards O(N)

performance when it's fed millions of 64-bit pseudo-random longs?




.



Relevant Pages

  • Re: How do I install redhat nine without a swap partition?
    ... distro on it. ... bobh@finch:~$ cat /proc/cpuinfo ... total used free shared buffers cached ...
    (comp.os.linux.misc)
  • Re: Will a damaed cat converter ...
    ... but a faulty lambda sensor can result in the cat ... Thats what we think has happened to our cat as we had a damaged heating ... catalyst: fail ... sec air sys: pass ...
    (uk.rec.cars.maintenance)
  • Re: Decimal 0.0 released
    ... sys 0m0.008s ... $ cat mem.rb ... VmSize: 129208 kB ... but you might want to add license notice to `decimal.c'. ...
    (comp.lang.ruby)
  • Re: Shebang line problems and python
    ... blaine@attila ~/tmp $ which python ... blaine@attila ~/tmp $ echo $SHELL ... blaine@attila ~/tmp $ cat ./shebang-test ... import sys ...
    (comp.lang.python)
  • Re: Ruby "Speedup" hints?
    ... > sys 0m17.715s ... > $ cat a.rb ... > require 'stringio' ... Gerardo Santana ...
    (comp.lang.ruby)