Re: perl file parsing
- From: d.tang@xxxxxxxxxxxxx (Dave Tang)
- Date: Fri, 24 Oct 2008 09:35:10 +1000
Hello,
Use bioperl (http://www.bioperl.org/wiki/Main_Page) for this task.
This should do what you want:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $fastaFile = 'myfile';
my $pattern = 'CTTGGCGAGAAGGGCCGCTACCTGCTGGCCGCCTCCTTCGGCAACGT';
my $blockThreshold = '500';
my $numFasta = '0';
my $match = '0';
my $q = Bio::SeqIO->new(-format=>'fasta', -file=>$fastaFile, -alphabet=>'dna');
while (my $seq = $q->next_seq()){
++$numFasta;
if ($seq->seq() =~ /$pattern/){
++$match;
}
}
$q->close();
print "Max number of blocks to search: $blockThreshold\n";
print "Number of blocks found in this file: $numFasta\n";
print "Total matches in $blockThreshold blocks: $match\n";
__END__
Dave
On Fri, 24 Oct 2008 03:38:33 +1000, minky arora <minky.arora@xxxxxxxxx> wrote:
Dear Gurus,
I have a file of the follwoing form
FFM50HR02GMY4E length=75 xy=2604_3772 region=2 run=R_2008_08_19_08_32_31_
GGGGTCAATGGGTCCGACGGAGAAAGCGCGACAGAGGGGAAAGCCCTTTCCCCTCCCCGT
TCGACTAGCGTCGTG
FFM50HR02F5QTS length=59 xy=2408_2686 region=2 run=R_2008_08_19_08_32_31_
AGGACATGCGGCCCGGCGACCTCATCATCTACTTCGACGACGCCAGCCACGTCGGGATG
It has over 5000 such blocks, each starting with ">". I need to search for a
given pattern (String of characters) in the second line of each block and
then print the block header (>FFM50HR02F5QTS). I only need to parse the
first 500 blocks of each file. Of these 500 blocks, I then need to output
the number of times the pattern has occured. My code is below. I didn't
think I has missed anythign till I manually went into each file to compare
the results, which don't match. Can someone point me to whats going wrong
here?
#!/usr/bin/perl
$file_to_parse = "/home/myfile";
$pattern = "CTTGGCGAGAAGGGCCGCTACCTGCTGGCCGCCTCCTTCGGCAACGT";
#$pattern = "abc";
$max_blocks = 500;
# open the data file
open (DAT, "$file_to_parse") || die ("Cannot open file: $file_to_parse");
$match_count = 0;
$block_count = 0;
$block = "";
while (<DAT>){
chomp (); #remove newline characters
if ($_ =~ /^>/ && $. > 0){ #beginning of the next block reached
#look for matches in the current block
if ($block_count <= $max_blocks){ # check not more than $max_blocks
$num_matches = () = $block =~ /$pattern/g; #how many matches in this
block
$match_count += $num_matches; #increase global match coutner
$block =~ /^(>.+?)\s/g; #get block ID, e.g. >FIFKRKM06HCSVV
$block_id = $1;
if ($num_matches > 0){ #output information
print "Block ID: $block_id\nBlock #: $block_count\nNumber of matches in
this block: $num_matches\n\n";
}
}
$block = ""; #empty block holder variable
$block_count++; #increase block count
}
#build the block, concatenate lines
$block .= $_;
}
close DAT;
print "Max number of blocks to search: $max_blocks\n";
print "Number of blocks found in this file: $block_count\n";
print "Total matches in $max_blocks blocks: $match_count\n\n";
# exit
exit;
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
.
- Prev by Date: Re: Help required in writing a script
- Next by Date: Re: perl file parsing
- Previous by thread: error while installing the module
- Next by thread: Re: perl file parsing
- Index(es):
Relevant Pages
|