Re: Newbie - converting csv files to arrays in NumPy - Matlab vs. Numpy comparison



Thank you so much. Your solution works! I greatly appreciate your
help.




sturlamolden wrote:
oyekomova wrote:

Thanks for your note. I have 1Gig of RAM. Also, Matlab has no problem
in reading the file into memory. I am just running Istvan's code that
was posted earlier.

You have a CSV file of about 520 MiB, which is read into memory. Then
you have a list of list of floats, created by list comprehension, which
is larger than 274 MiB. Additionally you try to allocate a NumPy array
slightly larger than 274 MiB. Now your process is already exceeding 1
GiB, and you are probably running other processes too. That is why you
run out of memory.

So you have three options:

1. Buy more RAM.

2. Low-level code a csv-reader in C.

3. Read the data in chunks. That would mean something like this:


import time, csv, random
import numpy

def make_data(rows=6E6, cols=6):
fp = open('data.txt', 'wt')
counter = range(cols)
for row in xrange( int(rows) ):
vals = map(str, [ random.random() for x in counter ] )
fp.write( '%s\n' % ','.join( vals ) )
fp.close()

def read_test():
start = time.clock()
arrlist = None
r = 0
CHUNK_SIZE_HINT = 4096 * 4 # seems to be good
fid = file('data.txt')
while 1:
chunk = fid.readlines(CHUNK_SIZE_HINT)
if not chunk: break
reader = csv.reader(chunk)
data = [ map(float, row) for row in reader ]
arrlist = [ numpy.array(data,dtype=float), arrlist ]
r += arrlist[0].shape[0]
del data
del reader
del chunk
print 'Created list of chunks, elapsed time so far: ', time.clock()
- start
print 'Joining list...'
data = numpy.empty((r,arrlist[0].shape[1]),dtype=float)
r1 = r
while arrlist:
r0 = r1 - arrlist[0].shape[0]
data[r0:r1,:] = arrlist[0]
r1 = r0
del arrlist[0]
arrlist = arrlist[0]
print 'Elapsed time:', time.clock() - start

make_data()
read_test()

This can process a CSV file of 6 million rows in about 150 seconds on
my laptop. A CSV file of 1 million rows takes about 25 seconds.

Just reading the 6 million row CSV file ( using fid.readlines() ) takes
about 40 seconds on my laptop. Python lists are not particularly
efficient. You can probably reduce the time to ~60 seconds by writing a
new CSV reader for NumPy arrays in a C extension.

.



Relevant Pages

  • Re: list.pop(0) vs. collections.dequeue
    ... allow the release of memory from the start of the chunk. ... the memory manager tries to defrag all those lists. ... if (ilow < 0) ...
    (comp.lang.python)
  • Re: Newbie - converting csv files to arrays in NumPy - Matlab vs. Numpy comparison
    ... You have a CSV file of about 520 MiB, ... if not chunk: break ... Python lists are not particularly ...
    (comp.lang.python)
  • Am I doing this the python way? (list of lists + file io)
    ... Sometime later, I'll get another CSV file, almost identical/related ... file I have in my list (in memory) to new CSV file (which I would probably ... would either update the original CSV file with the new CSV's information, ... So, to reiterate, are lists what I want to use? ...
    (comp.lang.python)
  • RE: Drop Down List
    ... If you want the userform to pick up the unique items from the csv file you ... Advanced Filter, Unique records command, one at a time for each of the fields ... to create unique lists for Data, ...
    (microsoft.public.excel.misc)
  • Re: Address book
    ... I thought that sharing of address book between OL and OE would be very ... It now seems that I should have two separate contact lists between OE and ... Roman King wrote: ... Export your OL contacts as a CSV file. ...
    (microsoft.public.windows.inetexplorer.ie6_outlookexpress)