Describing degerate dna strings

G

George George

I am working with strings of 4 letter alphabet a,c,t,g that describe
biological dna sequences. sometimes a sequence can be described as
ac[ta]cct meaning that at position 3 you are can have 't 'or an 'a'
without changing the biological function of the sequence.

Given ac[ta]cct as input i would like to generate a set of strings such
that it gives me the various combination of the strings that can
represent the above degenerate sequence e.g
1. actcct
2. acacct

both satisfy the above degeneracy.

any ideas?
thank you
 
B

Brian Candler

any ideas?

Here's a simple recursive expansion, with a block callback for each
sequence found.

def expand_seq(src, &blk)
if src =~ /\A(.*?)\[(.*?)\](.*)\z/m
prefix, chars, suffix = $1, $2, $3
chars.split(//).each do |ch|
expand_seq(prefix + ch + suffix, &blk)
end
else
yield src
end
end

expand_seq "ac[ta]cct[gt]c" do |seq|
puts seq
end
 
G

George George

Thank you!


Brian said:
any ideas?

Here's a simple recursive expansion, with a block callback for each
sequence found.

def expand_seq(src, &blk)
if src =~ /\A(.*?)\[(.*?)\](.*)\z/m
prefix, chars, suffix = $1, $2, $3
chars.split(//).each do |ch|
expand_seq(prefix + ch + suffix, &blk)
end
else
yield src
end
end

expand_seq "ac[ta]cct[gt]c" do |seq|
puts seq
end
 
J

Jesús Gabriel y Galán

I am working with strings of 4 letter alphabet a,c,t,g that describe
biological dna sequences. sometimes a sequence can be described as
ac[ta]cct meaning that at position 3 you are can have 't 'or an 'a'
without changing the biological function of the sequence.

Given ac[ta]cct as input i would like to generate a set of strings such
that it gives me the various combination of the strings that can
represent the above degenerate sequence e.g
1. actcct
2. acacct

both satisfy the above degeneracy.

any ideas?

Hi, this reminded me so much of a Ruby Quiz I solved that I wanted to
mention it :)

http://rubyquiz.com/quiz143.html
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/274375 (my solution)

This code generates all strings that match a regexp. So we are left
with the task of converting your strings to regexps:

irb(main):010:0> require 'quiz143'
=> true
irb(main):011:0> def expand a
irb(main):012:1> re = Regexp.new(a.gsub(/\[(.*?)\]/) {|m|
"(#{$1.split(//).join("|")})"})
irb(main):013:1> re.generate
irb(main):014:1> end
=> nil
irb(main):015:0> expand "ac[ta]cct"
=> ["actcct", "acacct"]

It's probably overkill for your needs.

Jesus.
 
G

George George

=> nil
irb(main):015:0> expand "ac[ta]cct"
=> ["actcct", "acacct"]

It's probably overkill for your needs.

Jesus.

hi Jesus!
Thank you for referencing me to that quiz, its nice to study the code.
That exactly solves one of the problems that i had while looking for dna
motifs which are represented as regular expressions, but need to be
expanded if you gonna use them as possible dna primers. and then such
back and see which one gives the best predictive value ... blah blah ...
Sorry for the bio talk :)

Thank you so much!!

GG
 
J

Jesús Gabriel y Galán

=> nil
irb(main):015:0> expand "ac[ta]cct"
=> ["actcct", "acacct"]

It's probably overkill for your needs.

Jesus.

hi Jesus!
Thank you for referencing me to that quiz, its nice to study the code.
That exactly solves one of the problems that i had while looking for dna
motifs which are represented as regular expressions, but need to be
expanded if you gonna use them as possible dna primers. and then such
back and see which one gives the best predictive value ... blah blah ...
Sorry for the bio talk :)

You are welcome. Just a comment on the above: I have realized that if
each position of the sequence is just one character, then your
original string is already a valid regexp for the problem, so no need
to change [ta] to (t|a) as I was doing, cause [ta] is a character
class with those two possibilities and those work too:

irb(main):001:0> require 'quiz143'
=> true
irb(main):002:0> /#{"ac[ta]cc"}/.generate
=> ["actcc", "acacc"]

:)

Jesus.
 
R

Rob Biedenharn

=3D> nil
irb(main):015:0> expand "ac[ta]cct"
=3D> ["actcct", "acacct"]

It's probably overkill for your needs.

Jesus.

hi Jesus!
Thank you for referencing me to that quiz, its nice to study the =20
code.
That exactly solves one of the problems that i had while looking =20
for dna
motifs which are represented as regular expressions, but need to be
expanded if you gonna use them as possible dna primers. and then such
back and see which one gives the best predictive value ... blah =20
blah ...
Sorry for the bio talk :)

You are welcome. Just a comment on the above: I have realized that if
each position of the sequence is just one character, then your
original string is already a valid regexp for the problem, so no need
to change [ta] to (t|a) as I was doing, cause [ta] is a character
class with those two possibilities and those work too:

irb(main):001:0> require 'quiz143'
=3D> true
irb(main):002:0> /#{"ac[ta]cc"}/.generate
=3D> ["actcc", "acacc"]

No need to do the string interpolation there:
/ac[ta]cc/.generate
Or if you have that in a string:
x=3D"ac[ta]cc"
Regexp.new(x).generate


-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)
 
J

Jesús Gabriel y Galán

On Jan 16, 2009, at 9:10 AM, Jes=FAs Gabriel y Gal=E1n wrote:
irb(main):002:0> /#{"ac[ta]cc"}/.generate
=3D> ["actcc", "acacc"]

No need to do the string interpolation there:
/ac[ta]cc/.generate
Or if you have that in a string:
x=3D"ac[ta]cc"
Regexp.new(x).generate

Good catch !!
Thanks.

Jesus.
 
G

George George

Thank you so much for all the replies. Here is a simple benchmark for
Brian and Jesus approaches. I Run it on ubuntu 8.04, 1GB RAM, 2 CPUs
3.40GHz.

.....
...
require 'benchmark'
Benchmark.bm do |bm|

bm.report("Brian:") do
expand_seq "t[ac][tc]aaattaag[ga]gaag[ac]ttggtgga" do |seq|
#puts seq
end
end

bm.report("Jesus:") do
/t[ac][tc]aaattaag[ga]gaag[ac]ttggtgga/.generate
end
end

ser system total real
Brian: 0.000000 0.000000 0.000000 ( 0.000642)
Jesus: 0.000000 0.000000 0.000000 ( 0.003574)
 
R

Robert Klemme

Thank you so much for all the replies. Here is a simple benchmark for
Brian and Jesus approaches. I Run it on ubuntu 8.04, 1GB RAM, 2 CPUs
3.40GHz.

....
..
require 'benchmark'
Benchmark.bm do |bm|

bm.report("Brian:") do
expand_seq "t[ac][tc]aaattaag[ga]gaag[ac]ttggtgga" do |seq|
#puts seq
end
end

bm.report("Jesus:") do
/t[ac][tc]aaattaag[ga]gaag[ac]ttggtgga/.generate
end
end

ser system total real
Brian: 0.000000 0.000000 0.000000 ( 0.000642)
Jesus: 0.000000 0.000000 0.000000 ( 0.003574)

You probably need to execute each variant in a loop multiple times to
get meaningful results.

Kind regards

robert
 
G

George George

You probably need to execute each variant in a loop multiple times to
get meaningful results.

Kind regards

robert

Thanks robert here are the results ran 100000 times for each approach

require 'benchmark'

iterations = 100000
Benchmark.bm do |bm|

bm.report("Brian:") do

iterations.times do
expand_seq "t[ac][tc]aaattaag[ga]gaag[ac]ttggtgga" do |seq|
# puts seq
end
end
end

bm.report("Jesus:") do
iterations.times do
/t[ac][tc]aaattaag[ga]gaag[ac]ttggtgga/.generate
end
end
end

user system total real
Brian: 36.500000 2.080000 38.580000 ( 38.738666)
Jesus: 217.180000 30.710000 247.890000 (248.848401)
 
J

Jesús Gabriel y Galán

Thanks robert here are the results ran 100000 times for each approach

user system total real
Brian: 36.500000 2.080000 38.580000 ( 38.738666)
Jesus: 217.180000 30.710000 247.890000 (248.848401)

It shows that a specialized solution could be more streamlined :).
Anyway, my solution was never optimized for performance. Could be an
interesting project...

Jesus.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,183
Messages
2,570,969
Members
47,524
Latest member
ecomwebdesign

Latest Threads

Top