Re: Using hashes to sort number sequences

From: Martin Foster (mdfoster44_at_netscape.net)
Date: 05/13/04


Date: 13 May 2004 09:03:38 -0700

Bob Walton <invalid-email@rochester.rr.com> wrote in message news:<40A2E65B.1020108@rochester.rr.com>...
> Martin Foster wrote:
>
> ...
> > I have two files: a.txt & b.txt
> >
> > a.txt=
> > 191_6_270328 T1 4 10 19 34 55 72 88 116 157 200 280 332 388 451 756 4
> > 0 5 0 4 0 6 2 6 2 8 0
> > 191_6_270328 T2 4 9 17 22 34 56 83 112 146 181 266 320 376 431 665 3 0
> ...
> > b.txt=
> > 191_6_9908682 T1 4 8 14 25 41 60 83 115 153 190 276 321 374 437 694 4
> > 0 4 0 4 0 6 0 4 0 8 0
> > 191_6_9908682 T2 4 10 19 30 44 64 92 122 155 198 285 338 394 446 739 4
> > 0 5 0 4 0 6 0 8 0 8 2
> ...
>
>
> > Each file contains in the first column an identifier, I call it $name.
> > The 2nd column contains an entry T1 or T2 or T3 ... until T6.
> > After these two columns each row contains a number sequence.
> >
> > What I would like to do is to read file a.txt, six lines at a time
> > (from T1 to T6)
> > and search for similar number sequences in file b.txt.
> > The number sequences in file b.txt must also be within each block of
> > six lines,
> > but they can be in any order.
>
>
> Why don't you just sort (using the Unix or maybe even the Win32 sort
> command) the two files, and then, using Perl, read and compare from the
> two sorted files? Or maybe the -u switch on Unix's sort could give you
> what you want in one go. Or maybe (if the data for matching lines is
> all the same), after the sorts, use diff to do the compare, and just
> process the output of diff with Perl? Or if there is something in the
> data which indicates if it from INFILE1 versus INFILE2, the files could
> be concatenated, sorted, and processed as one file (I don't think that
> last method would have any advantages).
>

I may need to tell you a little more about the data, I'm not sure a sort
would help me but maybe you have an idea.

Each $name tag is the name of a crystal structure. Each T1, T2, etc describes
an atom. For each structure there are six atoms. To identify if two crystal
structures are the same, one can compare the coordination sequences ( the number
sequences that follow the T1, T2, etc). For each structure all six sequences,
must completely match another six sequences of another structure, but they can
be in any order, ie T1, T2s may be called T3, T6 or whatever. The important
part is that each structure has six lines, which is why I want to read
them in separately. If I do a sort I will get matching lines of sequences
grouped together. For some structures, only one or two lines will match the
original structure and I will have to do careful counting throughout the
output to get what I want.

> That sort (punny, huh?) of method will avoid reading your $infile2 many
> hundreds of thousands of times, which will take almost forever.
>
Oh know, I hope not! My first attempt was to do this directly from the
MySQL database. I retrieved the data with queries for each structure.
That did take forever! So now I've put all the data into files.

I may need to rewrite the script to not reopen the b file again and again.
Maybe by passing in all the data to arrays and then shifting six lines of the
array into the hashes. I've got 512mb memory, how big can the arrays be?
I've got 29 columns and the a & b files have ~127,000 rows. I'm not sure.

> BTW, you would need to either close and reopen the file in your inner
> loop, or seek() it back to the beginning every time you go though the
> outer loop.
I will try this. Thanks.

>Also recognize that your while(<INFILE1>) and
> while(<INFILE2>) constructions will read a record from the corresponding
> file and place it into $_. You are discarding that data, so you are
> really reading data 7 records at a time, discarding the first of each
> chunk of 7.

I'm not quite sure what you mean by this. Do you suggest to use another
variable for $_ in the inner loop?

>
> HTH.
>
>
> ...
>
>
> > Martin.



Relevant Pages

  • Re: Using hashes to sort number sequences
    ... > and search for similar number sequences in file b.txt. ... Why don't you just sort (using the Unix or maybe even the Win32 sort ... of method will avoid reading your $infile2 many ... You are discarding that data, ...
    (comp.lang.perl.misc)
  • Re: Convergence of continuous function
    ... >a continuous function is a limit of sequences of continuous functions. ... depending on what sort of convergence ...
    (sci.math)
  • Re: Human contamination eliminated.
    ... a lot of EST libraries you're going to find a lot of this sort of thing. ... sequences from randomly selected species? ... which is whatever lives in the insect's gut. ... to sequence the midgut unless you also want what's in it. ...
    (talk.origins)
  • Re: Add(Merge) two vectors
    ... > True, but you can do the insert, then sort() just the portion that was ... then do inplace_mergeon two sequences. ... Prev by Date: ...
    (microsoft.public.vc.stl)
  • Re: Replacing subsequences
    ... >> different substring, of a different length (for quoting parameters to SQL ... fill-pointer that let me append sequences to my sequence at the fill ... Worth the price of avoiding loop, ... Is there anything besides lists and vectors to worry about? ...
    (comp.lang.lisp)