Re: out of memory



On Fri, 31 Oct 2008 13:09:23 -0700, Jürgen Exner <jurgenex@xxxxxxxxxxx> wrote:

"friend.05@xxxxxxxxx" <hirenshah.05@xxxxxxxxx> wrote:
I have two large files. I will read one file and see if that is also
present in second file.

The way you wrote this means you are checking if file A is a subset of
file B. However I have a strong feeling, you are talking about the
records in each file, not the files themself.

I also need count how many time it is appear
in both the file. And according I do other processing.

so if I process line by line both the file then it will be like (eg.
file1 has 10 line and file2 has 10 line. for each line file1 it will
loop 10 times. so total 100 loops.) I am dealing millions of lines so
this approach will be very slow.

So you need to pre-process your data.

One possibility: read only the smaller file into a hash. Then you can
compare the larger file line by line against this hash. This is a linear
algorithm. Of course this only works if at least the relevant data from
the smaller file will fit into RAM.

Another approach: sort both input files. There are many sorting
algorithms around, including those that sort completely on disk and
require very minimum RAM. They were very popular back when 32kB was a
lot of memory. Then you can walk through both files line by line in
parallel, requiring only a tiny little bit of RAM.
Depending upon the sorting algorithm this would be O(n)log(n) or
somewhat worse.

Yet another option: put your relevant data into a database and use
database operators to extract the information you want, in your case a
simple intersection: all records, that are in A and in B. Database
systems are optimized to handle large sets of data efficiently.

this is my current code. It runs fine with small file.

Well, that is great. But it seems you still don't believe me when I'm
saying that your problem cannot be fixed by a little tweak in your
existing code. Any gain you may get by storing a smaller data item or
similar will very soon be eaten up by larger data sets.
THIS IS NOT GOING TO WORK. YOU HAVE TO RETHINK YOUR APPROACH AND CHOOSE
A DIFFERENT STRATEGIE/ALGORITHM!

jue

He cannot get past the idea of 'millions' of lines in a file, even
though he states items of interrest. He won't think of items, just
the millions of lines.

In todays large data mining, there are billions of lines to consider.
Of course the least common denominator reduces that down to billions
of items.

Like a hash, it can be separated into alphabetical sequence files,
matched with available memory, usually 16 gigabytes, then reduced
exponentially until the desired form is achieved.

But his outlook is panicy and without resolve. The world is coming
to an end for him and he would like to share it with the world.

sln

.



Relevant Pages

  • Re: out of memory
    ... the smaller file will fit into RAM. ... Depending upon the sorting algorithm this would be Ologor ... put your relevant data into a database and use ...
    (comp.lang.perl.misc)
  • Re: out of memory
    ... the smaller file will fit into RAM. ... Depending upon the sorting algorithm this would be Ologor ... put your relevant data into a database and use ... bucket x from file B. ...
    (comp.lang.perl.misc)
  • Re: How do I hash this?
    ... I got some primary key failures which would indicate matching keys. ... Yes they'll ultimately be loaded in to a database. ... Your hash function could either require that the 3 integers be sorted ... rest of the algorithm. ...
    (comp.programming)
  • Re: How do I hash this?
    ... I got some primary key failures which would indicate matching keys. ... Yes they'll ultimately be loaded in to a database. ... Your hash function could either require that the 3 integers be sorted ... rest of the algorithm. ...
    (comp.programming)
  • Re: Newbie - Is this Reasonable?
    ... because this hash is stored in the database. ... So you use PKCS5v2 to generate a key hash from a salt and the user's passphrase, then store the salt and the hash in a database. ... are even more critical in database applications because the payoff from tampering with selected fields may be much higher, fields tend to be fixed-length so it's easier to tamper with them in a meaningful way, and databases lend themselves to off-line analysis, so the attacker can marshall more resources and take more time to attack your system. ... You're using a stream cipher for encryption. ...
    (sci.crypt)