Re: Regular Expression - Matching Multiples of 3 Characters exactly.



On Apr 27, 8:31 pm, blaine <frik...@xxxxxxxxx> wrote:
Hey everyone,
  For the regular expression gurus...

I'm trying to write a string matching algorithm for genomic
sequences.  I'm pulling out Genes from a large genomic pattern, with
certain start and stop codons on either side.  This is simple
enough... for example:

start = AUG stop=AGG
BBBBBBAUGWWWWWWAGGBBBBBB

So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
This works great with my current regular expression.

The problem, however, is that codons come in sets of 3 bases.  So
there are actually three different 'frames' I could be using.  For
example:
ABCDEFGHIJ
I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.

So finally, my question.  How can I represent this in a regular
expression? :)  This is what I'd like to do:
(Find all groups of any three characters) (Find a start codon) (find
any other codons) (Find an end codon)

Is this possible? It seems that I'd want to do something like this: (\w
\w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
three non-whitespace characters, followed by AUG \s AGG, and then
anything else.  I hope I am making sense.  Obviously, however, this
will make sure that ANY set of three characters exist before a start
codon.  Is there a way to match exactly, to say something like 'Find
all sets of three, then AUG and AGG, etc.'.  This way, I could scan
for genes, remove the first letter, scan for more genes, remove the
first letter again, and scan for more genes.  This would
hypothetically yield different genes, since the frame would be
shifted.

This might be a lot of information... I appreciate any insight.  Thank
you!
Blaine

Here's one idea (untested):

s= { }
for x in range( len( genes )- 3 ):
s[ x ]= genes[ x: x+ 3 ]

You might like Python's 'string slicing' feature.
.



Relevant Pages

  • Re: Regular Expression - Matching Multiples of 3 Characters exactly.
    ... I'm pulling out Genes from a large genomic pattern, ... This works great with my current regular expression. ... any other codons) ... for genes, remove the first letter, scan for more genes, remove the ...
    (comp.lang.python)
  • Re: DNA as a book
    ... amino acids make proteins so Codons make Genes. ... Verb ties Subject to Object, it is the boson of the relationship. ...
    (sci.chem)
  • Re: Regular Expression - Matching Multiples of 3 Characters exactly.
    ... I'm pulling out Genes from a large genomic pattern, ... This works great with my current regular expression. ... The problem, however, is that codons come in sets of 3 bases. ... for genes, remove the first letter, scan for more genes, remove the ...
    (comp.lang.python)
  • Regular Expression - Matching Multiples of 3 Characters exactly.
    ... I'm pulling out Genes from a large genomic pattern, ... This works great with my current regular expression. ... The problem, however, is that codons come in sets of 3 bases. ... for genes, remove the first letter, scan for more genes, remove the ...
    (comp.lang.python)
  • Re: Regular Expression - Matching Multiples of 3 Characters exactly.
    ... The problem, however, is that codons come in sets of 3 bases. ... of three, then AUG and AGG, etc.'. ... than you expect if there are two AUG...AGG sequences in a given genome. ... more genes, remove the first letter again, and scan for more genes. ...
    (comp.lang.python)