Re: Optimizing size of very large dictionaries



On 2008-07-31 02:29, python@xxxxxxxxxxx wrote:
Are there any techniques I can use to strip a dictionary data
structure down to the smallest memory overhead possible?

I'm working on a project where my available RAM is limited to 2G
and I would like to use very large dictionaries vs. a traditional
database.

Background: I'm trying to identify duplicate records in very
large text based transaction logs. I'm detecting duplicate
records by creating a SHA1 checksum of each record and using this
checksum as a dictionary key. This works great except for several
files whose size is such that their associated checksum
dictionaries are too big for my workstation's 2G of RAM.

If you don't have a problem with taking a small performance hit,
then I'd suggest to have a look at mxBeeBase, which is an on-disk
dictionary implementation:

http://www.egenix.com/products/python/mxBase/mxBeeBase/

Of course, you could also use a database table for this. Together
with a proper index that should work as well (but it's likely slower
than mxBeeBase).

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Jul 31 2008)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
.



Relevant Pages

  • Re: Optimizing size of very large dictionaries
    ... I'm working on a project where my available RAM is limited to 2G ... I'm trying to identify duplicate records in very ... checksum as a dictionary key. ... dictionaries are too big for my workstation's 2G of RAM. ...
    (comp.lang.python)
  • Re: Tar oddity...
    ... Are the checksum all the same or do they change? ... overclocking, bad power supply, bad RAM, bad CPU ... Try then with a bigger file, so it will have to read the disk. ... You can try the external disk, so you will test the USB part too. ...
    (Fedora)
  • Re: ROM-able dictionary
    ... Harvard architecture - separate ROM and RAM memory address spaces ... Has anyone done work on ROM-based dictionaries? ... http://www.pronews.com offers corporate packages that have access to 100,000+ newsgroups ...
    (comp.lang.forth)
  • Re: Continous eeprom checksum microcontroller
    ... >number of the contents are incorrect. ... If a checksum is performed on a ... >and am familiar with over-burdened mitigations. ... EEPROM is fundamentally different from RAM etc. because any errors ...
    (comp.arch.embedded)
  • Reclaiming (lots of) memory
    ... internal data structures, but can pack them to 20MB for the rest of ... easier way to get my RAM back? ... dictionaries, frequencies. ... small part of the memory to the OS; on my Linux system, ...
    (comp.lang.python)