Need feedback on ORF-extracting code

gb345 · Aug 12, 2009

Hi everyone. I'm relatively new to Python, and could use your
feedback on the code below.

First some nomenclature. A "nucleotide" is one of A, C, G, T, or
U. (In practice, a sequence of such nucleotides never contains
both T and U, but this fact is not important in what follows.) A
"codon" is a sequence of 3 of these. A "stop codon" is any one of
UAA, UAG, UGA, TAA, TAG, or TGA. Two codons are said to be "in
frame" in a containing sequence of nucleotides if their positions
differ by a multiple of 3. An "open reading frame," or ORF, is
defined as a maximal subsequence of nucleotides whose length is a
multiple of 3, begins with either AUG or ATG, terminates right
before a stop codon in the original sequence, and contains no stop
codons that are in frame with the initial codon (AUG or ATG).

The fact that ORFs have lengths that are multiples of 3 means that
there are three possible "registers" for ORFs (depending the modulo
3 of their starting positions), and that ORFs in different registers
can overlap. I'll refer to these registers as 0, 1, and 2, because
they contain ORFs that are in frame with the codons at positions
0, 1, and 2 of the original sequence, respectively.

In the code below, routine extract_orfs takes as input a string,
assumed to be a sequence of nucleotides, and returns a tuple of
tuples, describing ORFs. These ORFs can overlap.

The helper routine _xo takes as input a string (a sequence of
nucleotides), and an offset, and returns a tuple of tuples, again
representing ORFs, but this time all in the same register, since
they are all in frame with the position passed as the second argument
to the function.

I would appreciate your comments on this code. I feel that it is
not as clear as it could be, but I don't know how to make it any
clearer.

(NOTE: I realize that, in all likelihood, publicly available Python
code already exists for this. At the moment I'm more interested
in improving my Python skills.)

Many thanks in advance!

Gabe

# BEGINNING OF CODE
import sys
import re

_start = r'A[TU]G'
_stop = r'(?:[TU]A[AG]|[TU]GA)'
_nonstop = r'(?:[CAG][TUCAG]{2}|[TU](?:[TUC][TUCAG]|[AG][TUC])|[TU]GG)'
_codon = r'(?:[TUCAG]{3})'
_orf_re = re.compile('(' + _codon + r'*?)(A[TU]G' +
_nonstop + '*)(' + _stop + ')', flags=re.I)
_lead_re = re.compile(r'[TUCAG]*?A[TU]G', flags=re.I)

def _xo(seq, pos):
"""
Helper routine that finds all the non-overlapping in-frame orfs
starting from a specific position in the input sequence.

input: a string of letters in the set 'tucagTUCAG', and a starting
position;
output: a tuple of tuples; each internal tuple consists of a
starting position, an orf, and the stop codon that
terminates it.
"""

ret = []
while True:
m = _orf_re.match(seq, pos)
if not m:
break
orf = m.group(2)
stop = m.group(3)
assert len(orf) % 3 == 0
ret.append((m.start() + len(m.group(1)), orf, stop))
pos = m.end()
return ret

def extract_orfs(seq):
"""
Extracts all (possibly overlapping) maximal open reading frames,
defined as sequences beginning with AUG (or ATG), ending with an
in-frame stop codon, and containing no in-frame stop codons
in-between.

input: a string of letters in the set 'tucagTUCAG';
output: a tuple of tuples; each internal tuple consists of a
starting position, an orf, and the stop codon that
terminates it.
"""

m = _lead_re.match(seq)
if not m:
return ()
pos = m.start()
ret = []

for i in range(min(3, len(seq))):
ret.extend(_xo(seq, pos + i))
ret.sort()
return tuple(ret)

# END OF CODE

Terry Reedy · Aug 13, 2009

gb345 said:
Hi everyone. I'm relatively new to Python, and could use your
feedback on the code below.

First some nomenclature. A "nucleotide" is one of A, C, G, T, or
U. (In practice, a sequence of such nucleotides never contains
both T and U, but this fact is not important in what follows.) A
"codon" is a sequence of 3 of these. A "stop codon" is any one of
UAA, UAG, UGA, TAA, TAG, or TGA. Two codons are said to be "in
frame" in a containing sequence of nucleotides if their positions
differ by a multiple of 3. An "open reading frame," or ORF, is
defined as a maximal subsequence of nucleotides whose length is a
multiple of 3, begins with either AUG or ATG, terminates right
before a stop codon in the original sequence, and contains no stop
codons that are in frame with the initial codon (AUG or ATG).

The fact that ORFs have lengths that are multiples of 3 means that
there are three possible "registers" for ORFs (depending the modulo
3 of their starting positions), and that ORFs in different registers
can overlap. I'll refer to these registers as 0, 1, and 2, because
they contain ORFs that are in frame with the codons at positions
0, 1, and 2 of the original sequence, respectively.

In the code below, routine extract_orfs takes as input a string,
assumed to be a sequence of nucleotides, and returns a tuple of
tuples, describing ORFs. These ORFs can overlap.

The helper routine _xo takes as input a string (a sequence of
nucleotides), and an offset, and returns a tuple of tuples, again
representing ORFs, but this time all in the same register, since
they are all in frame with the position passed as the second argument
to the function.

I would appreciate your comments on this code. I feel that it is
not as clear as it could be, but I don't know how to make it any
clearer.

(NOTE: I realize that, in all likelihood, publicly available Python
code already exists for this. At the moment I'm more interested
in improving my Python skills.)

Many thanks in advance!

Gabe

# BEGINNING OF CODE
import sys
import re

_start = r'A[TU]G'
_stop = r'(?:[TU]A[AG]|[TU]GA)'
_nonstop = r'(?:[CAG][TUCAG]{2}|[TU](?:[TUC][TUCAG]|[AG][TUC])|[TU]GG)'
_codon = r'(?:[TUCAG]{3})'
_orf_re = re.compile('(' + _codon + r'*?)(A[TU]G' +
_nonstop + '*)(' + _stop + ')', flags=re.I)
_lead_re = re.compile(r'[TUCAG]*?A[TU]G', flags=re.I)

def _xo(seq, pos):
"""
Helper routine that finds all the non-overlapping in-frame orfs
starting from a specific position in the input sequence.

input: a string of letters in the set 'tucagTUCAG', and a starting
position;
output: a tuple of tuples; each internal tuple consists of a
starting position, an orf, and the stop codon that
terminates it.
"""

ret = []
while True:
m = _orf_re.match(seq, pos)
if not m:
break
orf = m.group(2)
stop = m.group(3)
assert len(orf) % 3 == 0

I am not currently familiar with the re module so I cannot comment in
detail. The assert should be a claim that the re should *never* match
anything other than a multiple of 3, so that it is a program bug if it
ever do so. If this is not true, you should use an if statement.

tjr

ret.append((m.start() + len(m.group(1)), orf, stop))
pos = m.end()
return ret

def extract_orfs(seq):
"""
Extracts all (possibly overlapping) maximal open reading frames,
defined as sequences beginning with AUG (or ATG), ending with an
in-frame stop codon, and containing no in-frame stop codons
in-between.

input: a string of letters in the set 'tucagTUCAG';
output: a tuple of tuples; each internal tuple consists of a
starting position, an orf, and the stop codon that
terminates it.
"""

m = _lead_re.match(seq)
if not m:
return ()
pos = m.start()
ret = []

for i in range(min(3, len(seq))):
ret.extend(_xo(seq, pos + i))
ret.sort()
return tuple(ret)

# END OF CODE

feedback on code design	23	May 30, 2012
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
processing the genetic code with python?	13	Mar 6, 2006
Need for speed -> a C extension?	27	Apr 18, 2011
Regular Expression - Matching Multiples of 3 Characters exactly.	6	Apr 28, 2008
Need feedback on improving Ruby code	1	Mar 16, 2007
Some silly code for Easter holiday	0	Mar 23, 2010
iterator question	14	Sep 26, 2006

Need feedback on ORF-extracting code

gb345

Terry Reedy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads