Re: Writing huge Sets() to disk

From: Steve Holden (steve_at_holdenweb.com)
Date: 01/17/05


Date: Mon, 17 Jan 2005 07:20:00 -0500

Martin MOKREJŠ wrote:

> Hi,
> could someone tell me what all does and what all doesn't copy
> references in python. I have found my script after reaching some
> state and taking say 600MB, pushes it's internal dictionaries
> to hard disk. The for loop consumes another 300MB (as gathered
> by vmstat) to push the data to dictionaries, then releases
> little bit less than 300MB and the program start to fill-up
> again it's internal dictionaries, when "full" will do the
> flush again ...
>
> The point here is, that this code takes a lot of extra memory.
> I believe it's the references problem, and I remeber complains
> of frineds facing same problem. I'm a newbie, yes, but don't
> have this problem with Perl. OK, I want to improve my Pyhton
> knowledge ... :-))
>
Right ho! In fact I suspect you are still quite new to programming as a
whole, for reasons that may become clear as we proceed.
>
>
>
> def push_to_disk(self):
> _dict_on_disk_tuple = (None, self._dict_on_disk1,
> self._dict_on_disk2, self._dict_on_disk3, self._dict_on_disk4,
> self._dict_on_disk5, self._dict_on_disk6, self._dict_on_disk7,
> self._dict_on_disk8, self._dict_on_disk9, self._dict_on_disk10,
> self._dict_on_disk11, self._dict_on_disk12, self._dict_on_disk13,
> self._dict_on_disk14, self._dict_on_disk15, self._dict_on_disk16,
> self._dict_on_disk17, self._dict_on_disk18, self._dict_on_disk19,
> self._dict_on_disk20)

It's a bit unfortunate that all those instance variables are global to
the method, as it means we can't clearly see what you intend them to do.
However ...

Whenever I see such code, it makes me suspect that the approach to the
problem could be more subtle. It appears you have decided to partition
your data into twenty chunks somehow. The algorithm is clearly not coded
in a way that would make it easy to modify the number of chunks.

[Hint: by "easy" I mean modifying a statement that reads

     chunks = 20

to read

     chunks = 40

for example]. To avoid this, we might use (say) a list of temp edicts,
for example (the length of this could easily then be parameterized as
mentioned. So where (my psychic powers tell me) your __init__() method
currently contains

     self._dict_on_disk1 = something()
     self._dict_on_disk2 = something()
         ...
     self._dict_on_disk20 = something()

I would have written

     self._disk_dicts = []
     for i in range(20):
         self._disk_dicts.append(something)

Than again, I probably have an advantage over you. I'm such a crappy
typist I can guarantee I'd make at least six mistakes doing it your way :-)

> _size = 0

What with all these leading underscores I presume it must be VERY
important to keep these object's instance variables private. Do you have
a particular reason for that, or just general Perl-induced paranoia? :-)

> #
> # sizes of these tmpdicts range from 10-10000 entries for each!
> for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3,
> self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7,
> self._tmpdict8, self._tmpdict9, self._tmpdict10, self._tmpdict11,
> self._tmpdict12, self._tmpdict13, self._tmpdict14, self._tmpdict15,
> self._tmpdict16, self._tmpdict17, self._tmpdict18, self._tmpdict19,
> self._tmpdict20):
> _size += 1
> if _tmpdict:
> _dict_on_disk = _dict_on_disk_tuple[_size]
> for _word, _value in _tmpdict.iteritems():
> try:
> _string = _dict_on_disk[_word]
> # I discard _a and _b, maybe _string.find(' ')
> combined with slice would do better?
> _abs_count, _a, _b, _expected_freq = _string.split()
> _abs_count = int(_abs_count).__add__(_value)
> _t = (str(_abs_count), '0', '0', '0')
> except KeyError:
> _t = (str(_value), '0', '0', '0')
>
> # this writes a copy to the dict, right?
> _dict_on_disk[_word] = ' '.join(_t)
>
> #
> # clear the temporary dictionaries in ourself
> # I think this works as expected and really does release memory
> #
> for _tmpdict in (self._tmpdict1, self._tmpdict2, self._tmpdict3,
> self._tmpdict4, self._tmpdict5, self._tmpdict6, self._tmpdict7,
> self._tmpdict8, self._tmpdict9, self._tmpdict10, self._tmpdict11,
> self._tmpdict12, self._tmpdict13, self._tmpdict14, self._tmpdict15,
> self._tmpdict16, self._tmpdict17, self._tmpdict18, self._tmpdict19,
> self._tmpdict20):
> _tmpdict.clear()
>
There you go again with that huge tuple. You just like typing, don't
you? You already wrote that one out just above. Couldn't you have
assigned it to a local variable?

By the way, remind me again of the reason for the leading None in the
_dict_on_disk_tuple, would you?

The crucial misunderstanding here might be the meaning of "release
memory". While clearing the dictionary will indeed remove references to
the objects formerly contained therein, and thus (possibly) render those
items subject to garbage collection, that *won't* make the working set
(i.e. virtual memory pages allocated to your process's data storage) any
smaller. The garbage collector doesn't return memory to the operating
system, it merely aggregates it for use in storing new Python objects.
>
>
>
> The above routine doesn't release of the memory back when it
> exits.
>
And your evidence for this assertion is ...?
>
> See, the loop takes 25 minutes already, and it's prolonging
> as the program is in about 1/3 or 1/4 of the total input.
> The rest of my code is fast in contrast to this (below 1 minute).
>
> -rw------- 1 mmokrejs users 257376256 Jan 17 11:38 diskdict12.db
> -rw------- 1 mmokrejs users 267157504 Jan 17 11:35 diskdict11.db
> -rw------- 1 mmokrejs users 266534912 Jan 17 11:28 diskdict10.db
> -rw------- 1 mmokrejs users 253149184 Jan 17 11:21 diskdict9.db
> -rw------- 1 mmokrejs users 250232832 Jan 17 11:14 diskdict8.db
> -rw------- 1 mmokrejs users 246349824 Jan 17 11:07 diskdict7.db
> -rw------- 1 mmokrejs users 199999488 Jan 17 11:02 diskdict6.db
> -rw------- 1 mmokrejs users 66584576 Jan 17 10:59 diskdict5.db
> -rw------- 1 mmokrejs users 5750784 Jan 17 10:57 diskdict4.db
> -rw------- 1 mmokrejs users 311296 Jan 17 10:57 diskdict3.db
> -rw------- 1 mmokrejs users 295895040 Jan 17 10:56 diskdict20.db
> -rw------- 1 mmokrejs users 293634048 Jan 17 10:49 diskdict19.db
> -rw------- 1 mmokrejs users 299892736 Jan 17 10:43 diskdict18.db
> -rw------- 1 mmokrejs users 272334848 Jan 17 10:36 diskdict17.db
> -rw------- 1 mmokrejs users 274825216 Jan 17 10:30 diskdict16.db
> -rw------- 1 mmokrejs users 273104896 Jan 17 10:23 diskdict15.db
> -rw------- 1 mmokrejs users 272678912 Jan 17 10:18 diskdict14.db
> -rw------- 1 mmokrejs users 260407296 Jan 17 10:13 diskdict13.db
>
> Some spoke about mmaped files. Could I take advantage of that
> with bsddb module or bsddb?
>
No.

> Is gdbm better in some ways? Recently you have said dictionary
> operations are fast ... Once more. I want to turn of locking support.
> I can make the values as strings of fixed size, if mmap() would be
> available. The number of keys doesn't grow much in time, mostly
> there are only updates.
>
Also (possibly because I come late to this thread) I don't really
understand your caching strategy. I presume at some stage you look in
one of the twenty temp dicts, and if you don;t find something you read
it back in form disk?

This whole thing seems a little disorganized. Perhaps if you started
with a small dataset your testing and development work would proceed
more quickly, and you'd be less intimidated by the clear need to
refactor your code.

regards
  Steve

-- 
Steve Holden               http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC      +1 703 861 4237  +1 800 494 3119


Relevant Pages

  • Re: Writing huge Sets() to disk
    ... > references in python. ... > state and taking say 600MB, pushes it's internal dictionaries ... Almost anything you do copies references. ... that this code takes a lot of extra memory. ...
    (comp.lang.python)
  • Re: Finding the instance reference of an object
    ... "variable" is commonly thought of as a fixed location in memory ... dinosaurs and instead compare Python to any modern OOP language. ... variables are just names/aliases for *references* to ...
    (comp.lang.python)
  • Re: Finding the instance reference of an object
    ... "variable" is commonly thought of as a fixed location in memory ... dinosaurs and instead compare Python to any modern OOP language. ... Is there any way to prove that, without delving into the Python ... variables are just names/aliases for *references* to ...
    (comp.lang.python)
  • Re: [Beginner] delete items of [] also from memory
    ... > I don't think you need to worry - Python handles memory allocation ... Objects with circular references may not be freed until the garbage ...
    (comp.lang.python)
  • Re: AW: [Python-Dev] Constructor bug
    ... two aspects of Python interacting here that explain the various behaviours you ... same dictionary living in the test class. ... references are going to be visible using the other references: ... So why do the immutable instance variables appear to be the same object?: ...
    (comp.lang.python)