Re: perl file parsing
- From: rob.dixon@xxxxxxx (Rob Dixon)
- Date: Fri, 24 Oct 2008 02:06:05 +0100
minky arora wrote:
I have a file of the follwoing form
FFM50HR02GMY4E length=75 xy=2604_3772 region=2 run=R_2008_08_19_08_32_31_
GGGGTCAATGGGTCCGACGGAGAAAGCGCGACAGAGGGGAAAGCCCTTTCCCCTCCCCGT
TCGACTAGCGTCGTG
FFM50HR02F5QTS length=59 xy=2408_2686 region=2 run=R_2008_08_19_08_32_31_
AGGACATGCGGCCCGGCGACCTCATCATCTACTTCGACGACGCCAGCCACGTCGGGATG
It has over 5000 such blocks, each starting with ">". I need to search for a
given pattern (String of characters) in the second line of each block and
then print the block header (>FFM50HR02F5QTS). I only need to parse the
first 500 blocks of each file. Of these 500 blocks, I then need to output
the number of times the pattern has occured. My code is below. I didn't
think I has missed anythign till I manually went into each file to compare
the results, which don't match. Can someone point me to whats going wrong
here?
#!/usr/bin/perl
$file_to_parse = "/home/myfile";
$pattern = "CTTGGCGAGAAGGGCCGCTACCTGCTGGCCGCCTCCTTCGGCAACGT";
#$pattern = "abc";
$max_blocks = 500;
# open the data file
open (DAT, "$file_to_parse") || die ("Cannot open file: $file_to_parse");
$match_count = 0;
$block_count = 0;
$block = "";
while (<DAT>){
chomp (); #remove newline characters
if ($_ =~ /^>/ && $. > 0){ #beginning of the next block reached
#look for matches in the current block
if ($block_count <= $max_blocks){ # check not more than $max_blocks
$num_matches = () = $block =~ /$pattern/g; #how many matches in this
block
$match_count += $num_matches; #increase global match coutner
$block =~ /^(>.+?)\s/g; #get block ID, e.g. >FIFKRKM06HCSVV
$block_id = $1;
if ($num_matches > 0){ #output information
print "Block ID: $block_id\nBlock #: $block_count\nNumber of matches in
this block: $num_matches\n\n";
}
}
$block = ""; #empty block holder variable
$block_count++; #increase block count
}
#build the block, concatenate lines
$block .= $_;
}
close DAT;
print "Max number of blocks to search: $max_blocks\n";
print "Number of blocks found in this file: $block_count\n";
print "Total matches in $max_blocks blocks: $match_count\n\n";
# exit
exit;
I'm afraid there is too much wrong with your program for me to try to rescue it.
Instead, I hope I can make some suggestions and point a few things out.
First of all, /always/
use strict;
use warnings;
at the start of every program, especially one that you are asking for help with.
You will then have to declare every variable using 'my', and it will save you
from a lot of simple mistakes.
Next, it is a bad idea to make /anything/ without trying it part-way through
building. If I was making a motor car from a collection of parts, no matter how
carefully I had followed the manual, I would be amazed if I could simply get
into the seat, turn the key, and drive down the road. But that is what you have
done with your program, and is what many less experience programmers expect to do.
Instead, you should write an incremental series of programs, with targets
something like this:
1 - Open your file, and make sure that you can read and print each line
2 - Print out just the block IDs in the file
3 - Accumulate the block data and print that out with its block ID
4 - Search for and count the substring within each block, and show those results
too
After that, but not before, you will have something approaching a working solution.
I won't say much more except that you are getting confused about what is in
$block and $block_id. Because the ID appears before the data in the file you are
associating each ID with the data in the previous block. Apart from being simply
wrong, it means that the first block ID has no corresponding data, and the data
from the last block in the file is just thrown away.
One more thing. $. is always greater than zero after a successful file read.
HTH,
Rob
.
- Prev by Date: Re: perl file parsing
- Next by Date: trouble with a regular expresion
- Previous by thread: Re: perl file parsing
- Next by thread: trouble with a regular expresion
- Index(es):
Relevant Pages
|