Re: look up very large table



In article <hku3e0$3fs$1@xxxxxxxxxxxxxxxxxxxxxxxxx>, ela
<ela@xxxxxxxxxx> wrote:

I have some large data in pieces, e.g.

asia.gz.tar 300M

or

roads1.gz.tar 100M
roads2.gz.tar 100M
roads3.gz.tar 100M
roads4.gz.tar 100M

I wonder whether I should concatenate them all into a single ultra large
file and then perform parsing them into a large table (I don't know whether
perl can handle that...).

There is no benefit that I can see to concatenating the files. Use the
File::Find module to find all files with a certain naming convention,
read each one, and process the information in each file. As far as the
amount of information that Perl can handles, that is mostly determined
by the available memory and how smart you are at condensing the data,
keeping only what you need and throwing away stuff you don't need.


The final table should look like this:

ID1 ID2 INFO
X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
X2.3 H9 Beijing; China; Asia

Perl does not have tables. It has arrays and hashes. You can nest
arrays and hashes to store complex datasets in memory by using
references.

....

each row may come from a big file of >100M (as aforementioned):

CITY Beijing
NOTE Capital
RACE Chinese
...

And then I have another much smaller table which contains all the ID's
(either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
this 20M file annotated with the INFO. Hashing seems not to be a solution
for my 32G, 8-core machine...

Any advice? or should i resort to some other languages?

Try reading all the files and saving the data you want. If you run out
of memory, then think about a different approach. 32GB of memory is
quite a lot.

If you can't fit all of your data into memory at one time, you might
consider using a database that will store your data in files. Perl has
support for many databases. But I would first determine whether or not
you can fit everything in memory.

--
Jim Gibson
.



Relevant Pages

  • Memory leak, DBI 1.55 + DBD::ODBC 1.13 + MS SQL 2000
    ... Recently I have bumped into a memory leak happening in DBI. ... My Perl build config: ... ActivePerl Build 819 ... 28671 Define PERL_NO_DEV_RANDOM on Windows ...
    (perl.dbi.users)
  • Re: why the perl docs suck
    ... allocate new memory, copy then free old? ... Perl getting to that complex computation. ... >> the granularity of a Perl push statment. ... >> you don't know that memory allocation is not contiguous! ...
    (comp.lang.perl.misc)
  • Re: Multiple "MY" Same Variable - Release Bytes?
    ... Perl do not truly release more memory ... > bytes but rather make that malloc (memory allotment) available for ... arenas, ...
    (comp.lang.perl.misc)
  • Re: Speed of reading some MB of data using qx(...)
    ... Perl_sv_grow then needs to decide if the string's memory is ... Certainly Perl seems to expect whatever malloc it's using to be smart ... If you timed perl's own realloc, you would find it does much ... Perl_sv_grow is called just as often as in Strawberry Perl. ...
    (comp.lang.perl.misc)
  • Re: Linked lists
    ... Does perl have this concept? ... Linked list helps in good memory management. ... In C you need linked lists because it is expensive and cumbersome to ... collection is called Reference Counting. ...
    (perl.beginners)