Re: Regular Expression - Matching Multiples of 3 Characters exactly.
- From: blaine <frikker@xxxxxxxxx>
- Date: Sun, 27 Apr 2008 19:31:42 -0700 (PDT)
On Apr 27, 10:24 pm, castiro...@xxxxxxxxx wrote:
On Apr 27, 8:31 pm, blaine <frik...@xxxxxxxxx> wrote:
Hey everyone,
For the regular expression gurus...
I'm trying to write a string matching algorithm for genomic
sequences. I'm pulling out Genes from a large genomic pattern, with
certain start and stop codons on either side. This is simple
enough... for example:
start = AUG stop=AGG
BBBBBBAUGWWWWWWAGGBBBBBB
So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
This works great with my current regular expression.
The problem, however, is that codons come in sets of 3 bases. So
there are actually three different 'frames' I could be using. For
example:
ABCDEFGHIJ
I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.
So finally, my question. How can I represent this in a regular
expression? :) This is what I'd like to do:
(Find all groups of any three characters) (Find a start codon) (find
any other codons) (Find an end codon)
Is this possible? It seems that I'd want to do something like this: (\w
\w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
three non-whitespace characters, followed by AUG \s AGG, and then
anything else. I hope I am making sense. Obviously, however, this
will make sure that ANY set of three characters exist before a start
codon. Is there a way to match exactly, to say something like 'Find
all sets of three, then AUG and AGG, etc.'. This way, I could scan
for genes, remove the first letter, scan for more genes, remove the
first letter again, and scan for more genes. This would
hypothetically yield different genes, since the frame would be
shifted.
This might be a lot of information... I appreciate any insight. Thank
you!
Blaine
Here's one idea (untested):
s= { }
for x in range( len( genes )- 3 ):
s[ x ]= genes[ x: x+ 3 ]
You might like Python's 'string slicing' feature.
True - I could try something like that. In fact I have a 'codon'
function that does exactly that. The problem is that I then have to
go back through and loop over the list. I'm trying to use Regular
Expressions so that my processing is quicker. Complexity is key since
this genomic string is pretty large.
Thanks for the suggestion though!
.
- References:
- Prev by Date: Re: Regular Expression - Matching Multiples of 3 Characters exactly.
- Next by Date: Re: Regular Expression - Matching Multiples of 3 Characters exactly.
- Previous by thread: Re: Regular Expression - Matching Multiples of 3 Characters exactly.
- Next by thread: Re: Regular Expression - Matching Multiples of 3 Characters exactly.
- Index(es):
Relevant Pages
|