Re: Large Amount of Data



Jack wrote:
"John Nagle" <nagle@xxxxxxxxxxx> wrote in message news:nfR5i.4273$C96.1640@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Jack wrote:
I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
What are you trying to do? At one extreme, you're implementing something
like a search engine that needs gigabytes of bitmaps to do joins fast as
hundreds of thousands of users hit the server, and need to talk seriously
about 64-bit address space machines. At the other, you have no idea how
to either use a database or do sequential processing. Tell us more.

> I have tens of millions (could be more) of document in files. Each of them
> has other
> properties in separate files. I need to check if they exist, update and
> merge properties, etc.
> And this is not a one time job. Because of the quantity of the files, I
> think querying and
> updating a database will take a long time...
>
And I think you are wrong. But of course the only way to find out who's right and who's wrong is to do some experiments and get some benchmark timings.

All I *would* say is that it's unwise to proceed with a memory-only architecture when you only have assumptions about the limitations of particular architectures, and your problem might actually grow to exceed the memory limits of a 32-bit architecture anyway.

Swapping might, depending on access patterns, cause you performance to take a real nose-dive. Then where do you go? Much better to architect the application so that you anticipate exceeding memory limits from the start, I'd hazard.

> Let's say, I want to do something a search engine needs to do in terms of
> the amount of
> data to be processed on a server. I doubt any serious search engine would
> use a database
> for indexing and searching. A hash table is what I need, not powerful
> queries.
>
You might be surprised. Google, for example, use a widely-distributed and highly-redundant storage format, but they certainly don't keep the whole Internet in memory :-)

Perhaps you need to explain the problem in more detail if you still need help.

regards
Steve


--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

.



Relevant Pages

  • Re: Spec for new machine used heavily for large excel models?
    ... I don't think dual core will make any difference with current Excel. ... take a noticeable amount of time to calculate when initiated (not huge ... I am assuming it is all down to processor speed and memory, ... can use to find posts by me in a search engine: ...
    (microsoft.public.excel)
  • Re: Accessing C++ application from c - kernel driver mode
    ... Search the internet for "Windows share memory user kernel". ... of hits were returned with one search engine. ... The first hit was to this location: ...
    (microsoft.public.development.device.drivers)
  • Program slowing down with greater memory use
    ... their memory use goes up. ... I'm not really sure if this is some sort of CPU cache effect, ... runs much faster than if I use a blocksize of 2**22. ... Then another program, which is a search engine, will slow way down if I ...
    (comp.lang.python)
  • Re: Le Sages Shadows (was Pioneer anomaly revisited)
    ... >> Well, for the explanation that you need, you could use your ... >> memory. ... Or a search engine. ... > Neither of which leads us to the author. ...
    (sci.physics.relativity)
  • Re: Le Sages Shadows (was Pioneer anomaly revisited)
    ... >> Well, for the explanation that you need, you could use your ... >> memory. ... Or a search engine. ... > Neither of which leads us to the author. ...
    (sci.physics)