Need feedback on ORF-extracting code

G

gb345

Hi everyone. I'm relatively new to Python, and could use your
feedback on the code below.

First some nomenclature. A "nucleotide" is one of A, C, G, T, or
U. (In practice, a sequence of such nucleotides never contains
both T and U, but this fact is not important in what follows.) A
"codon" is a sequence of 3 of these. A "stop codon" is any one of
UAA, UAG, UGA, TAA, TAG, or TGA. Two codons are said to be "in
frame" in a containing sequence of nucleotides if their positions
differ by a multiple of 3. An "open reading frame," or ORF, is
defined as a maximal subsequence of nucleotides whose length is a
multiple of 3, begins with either AUG or ATG, terminates right
before a stop codon in the original sequence, and contains no stop
codons that are in frame with the initial codon (AUG or ATG).

The fact that ORFs have lengths that are multiples of 3 means that
there are three possible "registers" for ORFs (depending the modulo
3 of their starting positions), and that ORFs in different registers
can overlap. I'll refer to these registers as 0, 1, and 2, because
they contain ORFs that are in frame with the codons at positions
0, 1, and 2 of the original sequence, respectively.

In the code below, routine extract_orfs takes as input a string,
assumed to be a sequence of nucleotides, and returns a tuple of
tuples, describing ORFs. These ORFs can overlap.

The helper routine _xo takes as input a string (a sequence of
nucleotides), and an offset, and returns a tuple of tuples, again
representing ORFs, but this time all in the same register, since
they are all in frame with the position passed as the second argument
to the function.

I would appreciate your comments on this code. I feel that it is
not as clear as it could be, but I don't know how to make it any
clearer.

(NOTE: I realize that, in all likelihood, publicly available Python
code already exists for this. At the moment I'm more interested
in improving my Python skills.)


Many thanks in advance!

Gabe



# BEGINNING OF CODE
import sys
import re

_start = r'A[TU]G'
_stop = r'(?:[TU]A[AG]|[TU]GA)'
_nonstop = r'(?:[CAG][TUCAG]{2}|[TU](?:[TUC][TUCAG]|[AG][TUC])|[TU]GG)'
_codon = r'(?:[TUCAG]{3})'
_orf_re = re.compile('(' + _codon + r'*?)(A[TU]G' +
_nonstop + '*)(' + _stop + ')', flags=re.I)
_lead_re = re.compile(r'[TUCAG]*?A[TU]G', flags=re.I)

def _xo(seq, pos):
"""
Helper routine that finds all the non-overlapping in-frame orfs
starting from a specific position in the input sequence.

input: a string of letters in the set 'tucagTUCAG', and a starting
position;
output: a tuple of tuples; each internal tuple consists of a
starting position, an orf, and the stop codon that
terminates it.
"""

ret = []
while True:
m = _orf_re.match(seq, pos)
if not m:
break
orf = m.group(2)
stop = m.group(3)
assert len(orf) % 3 == 0
ret.append((m.start() + len(m.group(1)), orf, stop))
pos = m.end()
return ret

def extract_orfs(seq):
"""
Extracts all (possibly overlapping) maximal open reading frames,
defined as sequences beginning with AUG (or ATG), ending with an
in-frame stop codon, and containing no in-frame stop codons
in-between.

input: a string of letters in the set 'tucagTUCAG';
output: a tuple of tuples; each internal tuple consists of a
starting position, an orf, and the stop codon that
terminates it.
"""

m = _lead_re.match(seq)
if not m:
return ()
pos = m.start()
ret = []

for i in range(min(3, len(seq))):
ret.extend(_xo(seq, pos + i))
ret.sort()
return tuple(ret)

# END OF CODE
 
T

Terry Reedy

gb345 said:
Hi everyone. I'm relatively new to Python, and could use your
feedback on the code below.

First some nomenclature. A "nucleotide" is one of A, C, G, T, or
U. (In practice, a sequence of such nucleotides never contains
both T and U, but this fact is not important in what follows.) A
"codon" is a sequence of 3 of these. A "stop codon" is any one of
UAA, UAG, UGA, TAA, TAG, or TGA. Two codons are said to be "in
frame" in a containing sequence of nucleotides if their positions
differ by a multiple of 3. An "open reading frame," or ORF, is
defined as a maximal subsequence of nucleotides whose length is a
multiple of 3, begins with either AUG or ATG, terminates right
before a stop codon in the original sequence, and contains no stop
codons that are in frame with the initial codon (AUG or ATG).

The fact that ORFs have lengths that are multiples of 3 means that
there are three possible "registers" for ORFs (depending the modulo
3 of their starting positions), and that ORFs in different registers
can overlap. I'll refer to these registers as 0, 1, and 2, because
they contain ORFs that are in frame with the codons at positions
0, 1, and 2 of the original sequence, respectively.

In the code below, routine extract_orfs takes as input a string,
assumed to be a sequence of nucleotides, and returns a tuple of
tuples, describing ORFs. These ORFs can overlap.

The helper routine _xo takes as input a string (a sequence of
nucleotides), and an offset, and returns a tuple of tuples, again
representing ORFs, but this time all in the same register, since
they are all in frame with the position passed as the second argument
to the function.

I would appreciate your comments on this code. I feel that it is
not as clear as it could be, but I don't know how to make it any
clearer.

(NOTE: I realize that, in all likelihood, publicly available Python
code already exists for this. At the moment I'm more interested
in improving my Python skills.)


Many thanks in advance!

Gabe



# BEGINNING OF CODE
import sys
import re

_start = r'A[TU]G'
_stop = r'(?:[TU]A[AG]|[TU]GA)'
_nonstop = r'(?:[CAG][TUCAG]{2}|[TU](?:[TUC][TUCAG]|[AG][TUC])|[TU]GG)'
_codon = r'(?:[TUCAG]{3})'
_orf_re = re.compile('(' + _codon + r'*?)(A[TU]G' +
_nonstop + '*)(' + _stop + ')', flags=re.I)
_lead_re = re.compile(r'[TUCAG]*?A[TU]G', flags=re.I)

def _xo(seq, pos):
"""
Helper routine that finds all the non-overlapping in-frame orfs
starting from a specific position in the input sequence.

input: a string of letters in the set 'tucagTUCAG', and a starting
position;
output: a tuple of tuples; each internal tuple consists of a
starting position, an orf, and the stop codon that
terminates it.
"""

ret = []
while True:
m = _orf_re.match(seq, pos)
if not m:
break
orf = m.group(2)
stop = m.group(3)
assert len(orf) % 3 == 0

I am not currently familiar with the re module so I cannot comment in
detail. The assert should be a claim that the re should *never* match
anything other than a multiple of 3, so that it is a program bug if it
ever do so. If this is not true, you should use an if statement.

tjr

ret.append((m.start() + len(m.group(1)), orf, stop))
pos = m.end()
return ret

def extract_orfs(seq):
"""
Extracts all (possibly overlapping) maximal open reading frames,
defined as sequences beginning with AUG (or ATG), ending with an
in-frame stop codon, and containing no in-frame stop codons
in-between.

input: a string of letters in the set 'tucagTUCAG';
output: a tuple of tuples; each internal tuple consists of a
starting position, an orf, and the stop codon that
terminates it.
"""

m = _lead_re.match(seq)
if not m:
return ()
pos = m.start()
ret = []

for i in range(min(3, len(seq))):
ret.extend(_xo(seq, pos + i))
ret.sort()
return tuple(ret)

# END OF CODE
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,233
Members
46,820
Latest member
GilbertoA5

Latest Threads

Top