Re: Writing huge Sets() to disk

From: Martin MOKREJŠ (mmokrejs_at_ribosome.natur.cuni.cz)
Date: 01/17/05


Date: Mon, 17 Jan 2005 13:32:27 +0100
To: duncan.booth@suttoncourtenay.org.uk

Duncan Booth wrote:
> Martin MOKREJ© wrote:
>
>
>>Hi,
>> could someone tell me what all does and what all doesn't copy
>>references in python. I have found my script after reaching some
>>state and taking say 600MB, pushes it's internal dictionaries
>>to hard disk. The for loop consumes another 300MB (as gathered
>>by vmstat) to push the data to dictionaries, then releases
>>little bit less than 300MB and the program start to fill-up
>>again it's internal dictionaries, when "full" will do the
>>flush again ...
>
>
> Almost anything you do copies references.

But what does this?:

x = 'xxxxx'
a = x[2:]
b = z + len(x)
dict[a] = b

>> The point here is, that this code takes a lot of extra memory.
>>I believe it's the references problem, and I remeber complains
>>of frineds facing same problem. I'm a newbie, yes, but don't
>>have this problem with Perl. OK, I want to improve my Pyhton
>>knowledge ... :-))
>>
>>
>>
>>
>
> <long code extract snipped>
>
>>
>> The above routine doesn't release of the memory back when it
>>exits.
>
> That's probably because there isn't any memory it can reasonable be
> expected to release. What memory would *you* expect it to release?

Thos 300MB, they get allocated/reserved when the posted loop get's
executed. When the loops exits, almost all is returned/deallocated.
Yes, almost. :(

>
> The member variables are all still accessible as member variables until you
> run your loop at the end to clear them, so no way could Python release
> them.

OK, I wanted to know if there's some assignment using a reference,
which makes the internal garbage collector not to recycle the memory,
as, for example, the target dictionary still keeps reference to the temporary
dictionary.

>
> Some hints:
>
> When posting code, try to post complete examples which actually work. I
> don't know what type the self._dict_on_diskXX variables are supposed to be.
> It makes a big difference if they are dictionaries (so you are trying to
> hold everything in memory at one time) or shelve.Shelf objects which would
> store the values on disc in a reasonably efficient manner.

The self._dict_on_diskXX are bsddb files, self._tmpdictXX are builtin dictionaries.

>
> Even if they are Shelf objects, I see no reason here why you have to

I gathered from previous discussion it's faster to use bsddb directly,
so no shelve.

> process everything at once. Write a simple function which processes one
> tmpdict object into one dict_on_disk object and then closes the

That's what I do, but in the for loop ...

> dict_on_disk object. If you want to compare results later then do that by

OK, I got your point.

> reopening the dict_on_disk objects when you have deleted all the tmpdicts.

That's what I do (not shown).

>
> Extract out everything you want to do into a class which has at most one
> tmpdict and one dict_on_disk That way your code will be a lot easier to
> read.
>
> Make your code more legible by using fewer underscores.
>
> What on earth is the point of an explicit call to __add__? If Guido had
> meant us to use __add__ he woudn't have created '+'.

To invoke additon directly on the object. It's faster than letting
python to figure out that I sum up int() plus int(). It definitely
has helped a lot when using Decimal(a) + Decimal(b), where I got rid
of thousands of Decimal(__new__), __init__ and I think few other
methods of decimal as well - I think getcontext too.

> What is the purpose of dict_on_disk? Is it for humans to read the data? If
> not, then don't store everything as a string. Far better to just store a

For humans is processed later.

> tuple of your values then you don't have to use split or cast the strings

bsddb creaps on me that I can store as a key or value only a string.
I'd love to store tuple.

>>> import bsddb
>>> _words1 = bsddb.btopen('db1.db', 'c')
>>> _words1['a'] = 1

Traceback (most recent call last):
 File "<stdin>", line 1, in ?
 File "/usr/lib/python2.3/bsddb/__init__.py", line 120, in __setitem__
   self.db[key] = value
TypeError: Key and Data values must be of type string or None.

>>>

How can I record a number then?

> to integers. If you do want humans to read some final output then produce
> that separately from the working data files.
>
> You split out 4 values from dict_on_disk and set three of them to 0. If
> that really what you meant or should you be preserving the previous values?

No, overwrite them, i.e. invalidate them. Originally I recorded only first,
but to compute the latter numbers is so expensive I have to store them.
As walking through the dictionaries is so slow, I gave up on an idea to
store just one, and a lot later in the program walk once again through the
dictionary and 'fix' it by computing missing values.

>
> Here is some (untested) code which might help you:
>
> import shelve

Why shelve? To have the ability to record tuple? Isn't it cheaper
to convert to string and back and write to bsddb compared to this overhead?

>
> def push_to_disc(data, filename):
> database = shelve.open(filename)
> try:
> for key in data:
> if database.has_key(key):
> count, a, b, expected = database[key]
> database[key] = count+data[key], a, b, expected
> else:
> database[key] = data[key], 0, 0, 0
> finally:
> database.close()
>
> data.clear()
>
> Call that once for each input dictionary and your data will be written out
> to a disc file and the internal dictionary cleared without any great spike
> of memory use.

Can I use the mmap() feature on bsddb or any .db file? Most of the time I do
updates, not inserts! I don't want to rewrite all the time 300MB file.
I want to update it. What I do need for it? Know the maximal length of a string
value keept in the .db file? Can I get rid of locking support in those huge
files?

Definitely I can improve my algorithm. But I believe I'll always have to work
with those huge files.
Martin



Relevant Pages

  • Re: storing references instead of copies in a dictionary
    ... I'm storing functions in a dictionary (this is basically for cooking up ... is there not a way _in general_ to specifically store ... Python stores references in dictionaries and does not copy! ...
    (comp.lang.python)
  • Re: Writing huge Sets() to disk
    ... >> state and taking say 600MB, pushes it's internal dictionaries ... that this code takes a lot of extra memory. ... >> I believe it's the references problem, ... the swapspace reserved grows during that posted loop. ...
    (comp.lang.python)
  • Re: Equates, object size and speed
    ... still use references like ... think of nothing like the weird and wonderful SQL or other schemas around. ... dictionaries etc just like you would expect ... programmers to enforce some kind of minimal standard. ...
    (comp.databases.pick)
  • Re: handling tabular data in python--newbie question
    ... Just jump in python few days. ... I am planning to use the column names as variables to access data, currently I am thinking of using a dictionary to store this file but did not figure out a elegant way to start. ... Let's store the rows in a dictionary of dictionaries, using the first column to index each row. ... rdict = dict(zip(names, cols)) ...
    (comp.lang.python)
  • Re: Writing huge Sets() to disk
    ... > references in python. ... > state and taking say 600MB, pushes it's internal dictionaries ... Almost anything you do copies references. ... that this code takes a lot of extra memory. ...
    (comp.lang.python)