Re: out of memory
- From: Jürgen Exner <jurgenex@xxxxxxxxxxx>
- Date: Fri, 31 Oct 2008 13:09:23 -0700
"friend.05@xxxxxxxxx" <hirenshah.05@xxxxxxxxx> wrote:
I have two large files. I will read one file and see if that is also
present in second file.
The way you wrote this means you are checking if file A is a subset of
file B. However I have a strong feeling, you are talking about the
records in each file, not the files themself.
I also need count how many time it is appear
in both the file. And according I do other processing.
so if I process line by line both the file then it will be like (eg.
file1 has 10 line and file2 has 10 line. for each line file1 it will
loop 10 times. so total 100 loops.) I am dealing millions of lines so
this approach will be very slow.
So you need to pre-process your data.
One possibility: read only the smaller file into a hash. Then you can
compare the larger file line by line against this hash. This is a linear
algorithm. Of course this only works if at least the relevant data from
the smaller file will fit into RAM.
Another approach: sort both input files. There are many sorting
algorithms around, including those that sort completely on disk and
require very minimum RAM. They were very popular back when 32kB was a
lot of memory. Then you can walk through both files line by line in
parallel, requiring only a tiny little bit of RAM.
Depending upon the sorting algorithm this would be O(n)log(n) or
somewhat worse.
Yet another option: put your relevant data into a database and use
database operators to extract the information you want, in your case a
simple intersection: all records, that are in A and in B. Database
systems are optimized to handle large sets of data efficiently.
this is my current code. It runs fine with small file.
Well, that is great. But it seems you still don't believe me when I'm
saying that your problem cannot be fixed by a little tweak in your
existing code. Any gain you may get by storing a smaller data item or
similar will very soon be eaten up by larger data sets.
THIS IS NOT GOING TO WORK. YOU HAVE TO RETHINK YOUR APPROACH AND CHOOSE
A DIFFERENT STRATEGIE/ALGORITHM!
jue
.
- Follow-Ups:
- Re: out of memory
- From: Jürgen Exner
- Re: out of memory
- References:
- out of memory
- From: friend.05@xxxxxxxxx
- Re: out of memory
- From: Juha Laiho
- Re: out of memory
- From: friend.05@xxxxxxxxx
- Re: out of memory
- From: Jürgen Exner
- Re: out of memory
- From: friend.05@xxxxxxxxx
- Re: out of memory
- From: friend.05@xxxxxxxxx
- out of memory
- Prev by Date: Re: out of memory
- Next by Date: Re: Profiling using DProf
- Previous by thread: Re: out of memory
- Next by thread: Re: out of memory
- Index(es):
Relevant Pages
|