Re: perl file parsing



Hello,

Use bioperl (http://www.bioperl.org/wiki/Main_Page) for this task.

This should do what you want:

#!/usr/bin/perl

use strict;
use warnings;
use Bio::SeqIO;

my $fastaFile = 'myfile';
my $pattern = 'CTTGGCGAGAAGGGCCGCTACCTGCTGGCCGCCTCCTTCGGCAACGT';
my $blockThreshold = '500';
my $numFasta = '0';
my $match = '0';

my $q = Bio::SeqIO->new(-format=>'fasta', -file=>$fastaFile, -alphabet=>'dna');
while (my $seq = $q->next_seq()){
++$numFasta;
if ($seq->seq() =~ /$pattern/){
++$match;
}
}
$q->close();

print "Max number of blocks to search: $blockThreshold\n";
print "Number of blocks found in this file: $numFasta\n";
print "Total matches in $blockThreshold blocks: $match\n";

__END__

Dave

On Fri, 24 Oct 2008 03:38:33 +1000, minky arora <minky.arora@xxxxxxxxx> wrote:

Dear Gurus,

I have a file of the follwoing form

FFM50HR02GMY4E length=75 xy=2604_3772 region=2 run=R_2008_08_19_08_32_31_

GGGGTCAATGGGTCCGACGGAGAAAGCGCGACAGAGGGGAAAGCCCTTTCCCCTCCCCGT

TCGACTAGCGTCGTG

FFM50HR02F5QTS length=59 xy=2408_2686 region=2 run=R_2008_08_19_08_32_31_

AGGACATGCGGCCCGGCGACCTCATCATCTACTTCGACGACGCCAGCCACGTCGGGATG





It has over 5000 such blocks, each starting with ">". I need to search for a
given pattern (String of characters) in the second line of each block and
then print the block header (>FFM50HR02F5QTS). I only need to parse the
first 500 blocks of each file. Of these 500 blocks, I then need to output
the number of times the pattern has occured. My code is below. I didn't
think I has missed anythign till I manually went into each file to compare
the results, which don't match. Can someone point me to whats going wrong
here?







#!/usr/bin/perl

$file_to_parse = "/home/myfile";
$pattern = "CTTGGCGAGAAGGGCCGCTACCTGCTGGCCGCCTCCTTCGGCAACGT";
#$pattern = "abc";

$max_blocks = 500;

# open the data file
open (DAT, "$file_to_parse") || die ("Cannot open file: $file_to_parse");
$match_count = 0;
$block_count = 0;
$block = "";
while (<DAT>){

chomp (); #remove newline characters

if ($_ =~ /^>/ && $. > 0){ #beginning of the next block reached

#look for matches in the current block

if ($block_count <= $max_blocks){ # check not more than $max_blocks

$num_matches = () = $block =~ /$pattern/g; #how many matches in this
block
$match_count += $num_matches; #increase global match coutner

$block =~ /^(>.+?)\s/g; #get block ID, e.g. >FIFKRKM06HCSVV
$block_id = $1;

if ($num_matches > 0){ #output information
print "Block ID: $block_id\nBlock #: $block_count\nNumber of matches in
this block: $num_matches\n\n";
}

}

$block = ""; #empty block holder variable
$block_count++; #increase block count


}

#build the block, concatenate lines
$block .= $_;

}
close DAT;

print "Max number of blocks to search: $max_blocks\n";
print "Number of blocks found in this file: $block_count\n";
print "Total matches in $max_blocks blocks: $match_count\n\n";

# exit
exit;



--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

.



Relevant Pages

  • Re: perl file parsing
    ... AGGACATGCGGCCCGGCGACCTCATCATCTACTTCGACGACGCCAGCCACGTCGGGATG ... the number of times the pattern has occured. ... chomp; #remove newline characters ...
    (perl.beginners)
  • Re: Calculate Last Digits
    ... This might work faster by adding the following line at this point: ... Exit For, as discussed at enormous length in older threads, but this is ... I did get a big boost by switching from strings to longs for the pattern ...
    (microsoft.public.vb.general.discussion)
  • what are the USERID,PASSWD in /etc/passwd
    ... echo "Usage: `basename $0` USERNAME" ... exit $E_WRONGARGS ... file_excerpt $pattern ...
    (comp.unix.shell)
  • Re: regexp (vim6)
    ... pattern would match Function, unless preceeded by Exit or End - which is ... Exit or End, followed by Function. ... Prev by Date: ...
    (comp.editors)
  • Re: Matching neighbouring words of a pattern using Regex
    ... > How can I match 'n' number of neighbouring words of a pattern using ... > concatenated together - so the text has many newline characters ... I only want the words on the same line as the pattern. ... > doesn't work for some reason. ...
    (comp.lang.perl.misc)