Re: Perl script to mimic uniq

nobull_at_mail.com
Date: 02/05/04


Date: 5 Feb 2004 06:14:24 -0800

mdfoster44@netscape.net (Martin Foster) wrote in message news:<6a20f90a.0402041647.6920fd75@posting.google.com>...
> nobull@mail.com wrote in message news:<4dafc536.0402030120.6236ac20@posting.google.com>...
> > mdfoster44@netscape.net (Martin Foster) spits TOFU in my face:
> >
> > > # Perl script to find most common CS
> >
> > I still don't get how this comment relates to what your program does
> > nor what you say you want it to do.
>
> The data list is a sequence of numbers, which are called coordination
> sequences, CS for short. My program tries to find the most common CS
> in the data file.

I still don't see anything in your program that relates to finding the
most common CS. It looks to me like your program is printing out the
number of occurances of each CS.

> > > I would like the script to search the file,
> > > identify a sequence as unique. If there are duplicate sequences
> > > in that file then print out how many and do not revisit that line
> > > if it has been counted as a duplicate.
> >
> > It's not clear what you are saying.
>
> There is a list of number sequences. Each list is labelled uniquely
> by an identifier. I want to sort through the list, so I starting at the
> 1st row and then my code loops through the list checking the
> sequences. If it finds a match, then that row does not need to be
> revisited again later in the loop, since it has been identified as a
> match to the 1st row. I guess I need to keep
> an index of some sort while looping the list. Then when I start at
> the 2nd row, I only loop over the sequences which are indexed as 'not
> yet matched'.

I think you are mixing up your definition of the problem you are
trying to solve with the implementation of a partial solution.

> I hope this makes more sense.

Not much.
  
> > Are you saying you want the first ID (only) and the number of
> > occurances of each distinct sequence?
>
> Yes. This is very helpful.

Right. So that's what you want one output line for each distinct CS
in no particular order. You don't want to find the CS that appears
most often.

If you wanted the output sorted in order of frequently you would have
to put a sort in there somewhere.

> >
> > while (<INFILE>) {
> > s/^(\S+\s+){2}// or die;
> > push @{$count{$_}}, $1;
> > };
> >
> > for ( values %count ) {
> > print "$_->[0]occurs ",scalar(@$_)," times\n";
> > }

> '$_->[0]' looks like a pointer.

This is no accident. The values of %count are references (pointers)
to arrays of IDs.

> So your piece of code, maps the $1 column of the original
> line as a pointer to the values of the %count array.

$1 in Perl is not like it is in awk.

In Perl $1 is whatever was captured by the first () capture in the
most recent regex in the current scope.

So in this case $1 is the first two columns (and the following
whitespace) of the original line. I believe, from what you've said
previously, that this is some sort of ID (identifier) and is not part
of the CS.

Actually you probably should thow away the whitespace between the ID
and the CS.

  s/^(\S+\s+\S+)\s+// or die;

Also if you want to improve reability you could avoid $_ and $1 and
also rename %count to something more appropriate to its new role:

  my ( $id, $cs ) = /^(\S+\s+\S+)\s+(.*)/ or die;
  push @{$ids_by_cs{$cs}}, $id;

> Then the "values" of
> %count are the unique "keys" of that array and "scalar" is counting
> the number of lines that are the same. Is that right?

There is nothing for "that array" to refer to in the previous
sentence.

The values of the hash %count (or %ids_by_cs) are (a list of) pointers
to arrays. Each array contains the series of IDs that correspond to a
single CS. The keys of the hash are the distinct CSs themselves.

As to the uniqueness of the IDs there is nothing in the program that
either ensures that nor cares that the IDs in the input data are
unique.

> "scalar" is counting the number of lines that are the same.

scalar is counting the number of elements in the array of IDs that
correspond to a single CS. So, yes, in effect this counts the number
of lines that were the same.

> Perl is great, but it so difficult to read if you don't have a clue.

Oh, you noticed that, did you? :-)



Relevant Pages

  • Re: "Sorting" assignment
    ... And many others prefer to call partition exchange because "quicksort" ... bin B depending on whether it is greater than, ... If the array is already sorted, this means that you end up ... attempt to sort them. ...
    (comp.programming)
  • Re: A Fast sorting algorithm for almost sorted data
    ... far my compressor has potential but is nowhere near ready. ... It does however make heavy use of sorting. ... which I am currently calling Run sort. ... entire selected run can be added to the sorted output array. ...
    (comp.compression)
  • Re: Save & Sort
    ... You can copy your array to a scratch ... "Heap" sort. ... Dim lst As Long ... Dim tmp As String ...
    (microsoft.public.excel.programming)
  • Re: fast stable sort
    ... if you have an existing array, you can simply arrange an array of ... pointers to the items, and sort that. ... hence require swapping pages of virtual memory, ... each merge run instead of accessing them in sequential RAM ...
    (comp.programming)
  • A Fast sorting algorithm for almost sorted data
    ... which I am currently calling Run sort. ... entire selected run can be added to the sorted output array. ... public class RunSort implements Comparator ... public static void sort(Comparable a, int start,int end) ...
    (comp.compression)