Re: out of memory
- From: Jürgen Exner <jurgenex@xxxxxxxxxxx>
- Date: Fri, 31 Oct 2008 13:18:46 -0700
Jürgen Exner <jurgenex@xxxxxxxxxxx> wrote:
"friend.05@xxxxxxxxx" <hirenshah.05@xxxxxxxxx> wrote:
I have two large files. I will read one file and see if that is also
present in second file.
The way you wrote this means you are checking if file A is a subset of
file B. However I have a strong feeling, you are talking about the
records in each file, not the files themself.
I also need count how many time it is appear
in both the file. And according I do other processing.
so if I process line by line both the file then it will be like (eg.
file1 has 10 line and file2 has 10 line. for each line file1 it will
loop 10 times. so total 100 loops.) I am dealing millions of lines so
this approach will be very slow.
So you need to pre-process your data.
One possibility: read only the smaller file into a hash. Then you can
compare the larger file line by line against this hash. This is a linear
algorithm. Of course this only works if at least the relevant data from
the smaller file will fit into RAM.
Another approach: sort both input files. There are many sorting
algorithms around, including those that sort completely on disk and
require very minimum RAM. They were very popular back when 32kB was a
lot of memory. Then you can walk through both files line by line in
parallel, requiring only a tiny little bit of RAM.
Depending upon the sorting algorithm this would be O(n)log(n) or
somewhat worse.
Yet another option: put your relevant data into a database and use
database operators to extract the information you want, in your case a
simple intersection: all records, that are in A and in B. Database
systems are optimized to handle large sets of data efficiently.
Forgot one other common approach: bucketize your data.
Create buckets of IPs or IDs or whatever criteria works for your case.
Then sort the data into 20 or 50 or 100 individual buckets (aka files)
for each of your input files. And then compare bucket x from file A with
bucket x from file B.
jue
.
- References:
- out of memory
- From: friend.05@xxxxxxxxx
- Re: out of memory
- From: Juha Laiho
- Re: out of memory
- From: friend.05@xxxxxxxxx
- Re: out of memory
- From: Jürgen Exner
- Re: out of memory
- From: friend.05@xxxxxxxxx
- Re: out of memory
- From: friend.05@xxxxxxxxx
- Re: out of memory
- From: Jürgen Exner
- out of memory
- Prev by Date: Re: Profiling using DProf
- Next by Date: Re: How to overwrite or mock -e for testing?
- Previous by thread: Re: out of memory
- Next by thread: Re: out of memory
- Index(es):
Relevant Pages
|
Loading