Re: out of memory



Jürgen Exner <jurgenex@xxxxxxxxxxx> wrote:
"friend.05@xxxxxxxxx" <hirenshah.05@xxxxxxxxx> wrote:
I have two large files. I will read one file and see if that is also
present in second file.

The way you wrote this means you are checking if file A is a subset of
file B. However I have a strong feeling, you are talking about the
records in each file, not the files themself.

I also need count how many time it is appear
in both the file. And according I do other processing.

so if I process line by line both the file then it will be like (eg.
file1 has 10 line and file2 has 10 line. for each line file1 it will
loop 10 times. so total 100 loops.) I am dealing millions of lines so
this approach will be very slow.

So you need to pre-process your data.

One possibility: read only the smaller file into a hash. Then you can
compare the larger file line by line against this hash. This is a linear
algorithm. Of course this only works if at least the relevant data from
the smaller file will fit into RAM.

Another approach: sort both input files. There are many sorting
algorithms around, including those that sort completely on disk and
require very minimum RAM. They were very popular back when 32kB was a
lot of memory. Then you can walk through both files line by line in
parallel, requiring only a tiny little bit of RAM.
Depending upon the sorting algorithm this would be O(n)log(n) or
somewhat worse.

Yet another option: put your relevant data into a database and use
database operators to extract the information you want, in your case a
simple intersection: all records, that are in A and in B. Database
systems are optimized to handle large sets of data efficiently.

Forgot one other common approach: bucketize your data.
Create buckets of IPs or IDs or whatever criteria works for your case.
Then sort the data into 20 or 50 or 100 individual buckets (aka files)
for each of your input files. And then compare bucket x from file A with
bucket x from file B.

jue
.



Relevant Pages

  • Re: out of memory
    ... read only the smaller file into a hash. ... the smaller file will fit into RAM. ... Depending upon the sorting algorithm this would be Ologor ... put your relevant data into a database and use ...
    (comp.lang.perl.misc)
  • Re: out of memory
    ... the smaller file will fit into RAM. ... Depending upon the sorting algorithm this would be Ologor ... put your relevant data into a database and use ...
    (comp.lang.perl.misc)
  • Re: efficient compare
    ... I want a generic algorithm which works under all circumstances, where you pay a "cost proportional to the amount of asynchrony". ... It makes no sense to choose N 1 updates in each bucket meaning you will send Omessages plus Otuples. ... O) [for transfering M checksum path of the checksum tree) + ...
    (comp.databases.theory)
  • Re: SQL Server Indexing VS. FTS Indexing
    ... different algorithm to calculate weights and ranking. ... What happens is each word is indexed and placed in a bucket. ... whittling down your results sets when you have multiple search terms. ... Keep in mind that the performance of your SQL FTS solution is most sensitive ...
    (microsoft.public.sqlserver.fulltext)
  • Re: Exchange server defragging using ESEutil
    ... I was thinking along the lines that some of email accounts are no longer used ... I don't have alot of space available to increase the store and I ... bucket for yor email store. ... bucket.(and your Exchange database will grow as needed to the ...
    (microsoft.public.windows.server.sbs)

Loading