Re: Newbie completely confused



Jeroen Hegeman schreef:
Thanks for the comments,

(First, I had to add timing code to ReadClasses: the code you posted
doesn't include them, and only shows timings for ReadLines.)

Your program uses quite a bit of memory. I guess it gets harder and
harder to allocate the required amounts of memory.

Well, I guess there could be something in that, but why is there a significant increase after the first time? And after that, single- trip time pretty much flattens out. No more obvious increases.

Sorry, I have no idea.

If I change this line in ReadClasses:

built_classes[len(built_classes)] = HugeClass(long_line)

to

dummy = HugeClass(long_line)

then both times the files are read and your data structures are built,
but after each run the data structure is freed. The result is that both
runs are equally fast.

Isnt't the 'del LINES' supposed to achieve the same thing? And really, reading 30MB files should not be such a problem, right? (I'm also running with 1GB of RAM.)

'del LINES' deletes the lines that are read from the file, but not all of your data structures that you created out of them.
Now, indeed, reading 30 MB files should not be a problem. And I am confident that just reading the data is not a problem. To make sure I created a simple test:

import time

input_files = ["./test_file0.txt", "./test_file1.txt"]

total_start = time.time()
data = {}
for input_fn in input_files:
file_start = time.time()
f = file(input_fn, 'r')
data[input_fn] = f.read()
f.close()
file_done = time.time()
print '%s: %f to read %d bytes' % (input_fn, file_done - file_start, len(data))
total_done = time.time()
print 'all done in %f' % (total_done - total_start)


When I run that with test_file0.txt and test_file1.txt as you described (each 30 MB), I get this output:

../test_file0.txt: 0.260000 to read 1 bytes
../test_file1.txt: 0.251000 to read 2 bytes
all done in 0.521000

Therefore I think the problem is not in reading the data, but in processing it and creating the data structures.

You read the files, but don't use the contents; instead you use
long_line over and over. I suppose you do that because this is a test,
not your actual code?

Yeah ;-) (Do I notice a lack of trust in the responses I get? Should I not mention 'newbie'?)

I didn't mean to attack you; it's just that the program reads 30 MB of data, twice, but doesn't do anything with it. It only uses the data that was stored in long_lines, and which never is replaced. That is very strange for real code, but as a test it can have it's uses. That's why I asked.

Let's get a couple of things out of the way:
- I do know about meaningful variable names and case-conventions, but ... First of all I also have to live with inherited code (I don't like people shouting in their code either), and secondly (all the itemx) most of these members normally _have_ descriptive names but I'm not supposed to copy-paste the original code to any newsgroups.

Ok.

- I also know that a plain 'return' in python does not do anything but I happen to like them. Same holds for the sys.exit() call.

Ok.

- The __init__ methods normally actually do something: they initialise some member variables to meaningful values (by calling the clear() method, actually).
- The __clear__ method normally brings objects back into a well- defined 'empty' state.
- The __del__ methods are actually needed in this case (well, in the _real_ code anyway). The python code loads a module written in C++ and some of the member variables actually point to C++ objects created dynamically, so one actually has to call their destructors before unbinding the python var.

That sounds a bit weird to me; I would think such explicit memory management belongs in the C++ code instead of in the Python code, but I must admit that I know next to nothing about extending Python so I assume you are right.

All right, thanks for the tips. I guess the issue itself is still open, though.

I'm afraid so. Sorry I can't help.

One thing that helped me in the past to speed up input is using memory mapped I/O instead of stream I/O. But that was in C++ on Windows; I don't know if the same applies to Python on Linux.

--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov

Roel Schroeven
.



Relevant Pages

  • Re: Tcl faster than Perl/Python...but only with tricks...
    ... it iterates over each line in the file applying the regexp and drops ... Python and Perl are that fast *without* reading the file at once into ... memory, ...
    (comp.lang.tcl)
  • Memory limit to dict?
    ... I was wondering whether certain data structures in Python, e.g. dict, ... might have limits as to the amount of memory they're allowed to take up. ...
    (comp.lang.python)
  • Re: 2.6, 3.0, and truly independent intepreters
    ... just the GIL being in place, but of course it's there for a reason. ... Python faster on single core machines and more stable on multi core ... Other language designers think the same way. ... with languages that use memory pointers, have the potential to get out ...
    (comp.lang.python)
  • Re: why cannot assign to function call
    ... a porch the same if all its ... actual Python objects tend to be mutable only if they are ... In CPython, the id is given by the memory location of the object, which ... significant changes and replacements and add-ons that nevertheless don't ...
    (comp.lang.python)
  • Re: Writing huge Sets() to disk
    ... that this code takes a lot of extra memory. ... > I believe it's the references problem, ... It's a bit unfortunate that all those instance variables are global to ... it merely aggregates it for use in storing new Python objects. ...
    (comp.lang.python)