Looking for a trick with strings

J

January Weiner

Hi,

here is the problem.

I have strings (protein sequences, to be precise) that contain letters +
'-':

my $s = 'Y---ERI-TTKDIV----EIKRHLDYLQAPRITNNDLE' ; # for example
my $s_nogaps = $s ;
$s_nogaps =~ s/-//g ;

Then I get two numbers:

my ($from, $length) = (0, 5) ;

These are the indices of a substring within $s_nogaps (and not $s). The
above indices correspond to a string 'YERIT'.

Now I would like to have fragments of $s corresponding to this substring.
For example, given the above indices, I should get 'Y--ERI-T'.

The best that I can come up with is looping over each character separately,
and incrementing a counter only if it is:


sub get_subseq {
my ($s, $f, $l) = @_ ;
my $i = 0 ;
my $ret ;
my $within = 0 ;

for(split //, $s) {
last if($i >= $f + $l) ;
if($_ ne '-') { $i++ ; }
next unless($i > $f) ;
$ret .= $_ ;
}

return $ret ;
}

any better ideas?

j.

--
 
L

Lukas Mai

January Weiner said:
I have strings (protein sequences, to be precise) that contain letters +
'-':

my $s = 'Y---ERI-TTKDIV----EIKRHLDYLQAPRITNNDLE' ; # for example
my $s_nogaps = $s ;
$s_nogaps =~ s/-//g ;

Then I get two numbers:

my ($from, $length) = (0, 5) ;

These are the indices of a substring within $s_nogaps (and not $s). The
above indices correspond to a string 'YERIT'.

Now I would like to have fragments of $s corresponding to this substring.
For example, given the above indices, I should get 'Y--ERI-T'.

Hmm, you could define a gappy letter as [A-Z], followed by 0 or more
dashes. Then you need to match a prefix of $from gappy letters, followed
by $length gappy letter, which you extract.

sub get_subseq {
my ($str, $from, $len) = @_;
return ($str =~ /(?:[A-Z]-*){$from}((?:[A-Z]-*){$len})/)[0];
}

The only problem with this approach is that it will return all trailing
dashes, i.e. get_subseq('Y--ERI-TT', 0, 1) eq 'Y--'.

HTH, Lukas
 
I

it_says_BALLS_on_your_forehead

January said:
Hi,

here is the problem.

I have strings (protein sequences, to be precise) that contain letters +
'-':

my $s = 'Y---ERI-TTKDIV----EIKRHLDYLQAPRITNNDLE' ; # for example
my $s_nogaps = $s ;
$s_nogaps =~ s/-//g ;

Then I get two numbers:

my ($from, $length) = (0, 5) ;

These are the indices of a substring within $s_nogaps (and not $s). The
above indices correspond to a string 'YERIT'.

Now I would like to have fragments of $s corresponding to this substring.
For example, given the above indices, I should get 'Y--ERI-T'.

--8 said:
any better ideas?


unless you really need some homegrown solution, why reinvent the wheel?
go to www.ncbi.nih.gov and use the free BLAST tool for sequence
alignment. that's what i did when i was doing bioinformatics data
mining.
 
A

Anno Siegel

January Weiner said:
Hi,

here is the problem.

I have strings (protein sequences, to be precise) that contain letters +
'-':

my $s = 'Y---ERI-TTKDIV----EIKRHLDYLQAPRITNNDLE' ; # for example
my $s_nogaps = $s ;
$s_nogaps =~ s/-//g ;

Then I get two numbers:

my ($from, $length) = (0, 5) ;

These are the indices of a substring within $s_nogaps (and not $s). The
above indices correspond to a string 'YERIT'.

Now I would like to have fragments of $s corresponding to this substring.
For example, given the above indices, I should get 'Y--ERI-T'.

The best that I can come up with is looping over each character separately,
and incrementing a counter only if it is:


sub get_subseq {
my ($s, $f, $l) = @_ ;
my $i = 0 ;
my $ret ;
my $within = 0 ;

for(split //, $s) {
last if($i >= $f + $l) ;
if($_ ne '-') { $i++ ; }
next unless($i > $f) ;
$ret .= $_ ;
}

return $ret ;
}

Create an offset table @offset that points, for every character position
in $s_nogaps, to the corresponding character in $s. That is, the array
would start out ( 0, 4, 5, 6, 8, ...). Use that table to calculate
the character positions in $s, given character positions in $s_nogap.

my $s = 'Y---ERI-TTKDIV----EIKRHLDYLQAPRITNNDLE' ; # for example

my @offset = do {
my @char = split //, $s;
grep $char[ $_] ne '-', 0 .. $#char;
};

(my $s_nogaps = $s) =~ tr/-//d;

my ($from, $length) = ( 0, 5);
my $ret = do {
my $to = $from + $length;
substr( $s, $offset[ $from], $offset[ $to] - $offset[ $from]);
};

print "$ret\n";

Anno
 
J

January Weiner

it_says_BALLS_on_your_forehead

no it doesn't :p

unless you really need some homegrown solution, why reinvent the wheel?
go to www.ncbi.nih.gov and use the free BLAST tool for sequence
alignment. that's what i did when i was doing bioinformatics data
mining.

:) Nope, I am not doing sequence alignments (and apart from that, BLAST is
a heuristic algorithm not guaranteed to achieve the optimal alignment, but
then, of course, one has the Bioperl suite as well). I need it for a
program, which (among other things) visualises multiple alignments, and
does various stuff on fragments thereof.

j.

--
 
A

Aaron Baugher

January Weiner said:
I have strings (protein sequences, to be precise) that contain
letters + '-':

my $s = 'Y---ERI-TTKDIV----EIKRHLDYLQAPRITNNDLE' ; # for example
my $s_nogaps = $s ;
$s_nogaps =~ s/-//g ;

Then I get two numbers:

my ($from, $length) = (0, 5) ;

These are the indices of a substring within $s_nogaps (and not $s).
The above indices correspond to a string 'YERIT'.

Now I would like to have fragments of $s corresponding to this
substring. For example, given the above indices, I should get
'Y--ERI-T'.

I'd start out the way you are: strip the hyphens from the line and get
your substring. Then turn the substring into a search regex that will
accept any number of hyphens between each letter. Tested:

#!/usr/bin/perl -w
use strict;

my( $from, $length ) = (0,5);

my $s = 'Y---ERI-TTKDIV----EIKRHLDYLQAPRITNNDLE';
( my $nogaps = $s ) =~ s/-//g;
my $string = substr( $nogaps, $from, $length );
my $search = join '-*', split //, $string;
print "Search: $search\n";
$s =~ /($search)/;
my $result = $1;
print "Result: $result\n";

That could be simplified to use fewer lines and variables.
 
A

Anno Siegel

Aaron Baugher said:
I'd start out the way you are: strip the hyphens from the line and get
your substring. Then turn the substring into a search regex that will
accept any number of hyphens between each letter. Tested:

[code snipped]

That doesn't give a unique identification in all cases. If

$s = 'E-E-E-E-E';

the substring "EE" of $s_nogaps can't be uniquely identified. Less trivial
examples are also possible.

Anno
 
J

January Weiner

Aaron Baugher said:
I'd start out the way you are: strip the hyphens from the line and get
your substring. Then turn the substring into a search regex that will
accept any number of hyphens between each letter. Tested:

There can be different, but identical substrings in $s_nogaps, which,
however, have different numbers of gaps :).

j.

--
 
J

January Weiner

Anno Siegel said:
Create an offset table @offset that points, for every character position
in $s_nogaps, to the corresponding character in $s. That is, the array
would start out ( 0, 4, 5, 6, 8, ...). Use that table to calculate
the character positions in $s, given character positions in $s_nogap.

(snip)

Thanks! This is an elegant solution to my problem. I'm running benchmarks
to see which one is quicker (say, you have hundreds sequences, each with a
length of a couple of thousands, and you are searching for several
substrings in each sequence...).

Cheers,

January

--
 
J

January Weiner

Lukas Mai said:
Hmm, you could define a gappy letter as [A-Z], followed by 0 or more
dashes. Then you need to match a prefix of $from gappy letters, followed
by $length gappy letter, which you extract.
sub get_subseq {
my ($str, $from, $len) = @_;
return ($str =~ /(?:[A-Z]-*){$from}((?:[A-Z]-*){$len})/)[0];
}

Oww, this is nice and compact.
The only problem with this approach is that it will return all trailing
dashes, i.e. get_subseq('Y--ERI-TT', 0, 1) eq 'Y--'.

This is not a problem at all, I can alwas strip the remaining -'s.
Could you explain why you use the (?: ) construct? It says in the perlre
that

This is for clustering, not capturing; it groups subexpressions like
"()", but doesn't make backreferences as "()" does. So

@fields = split(/\b(?:a|b|c)\b/)

is like

@fields = split(/\b(a|b|c)\b/)

but doesn't spit out extra fields. It's also cheaper not to capture
characters if you don't need to.

I assume that with (?: ) you dont get $1 and $2 in our example, but why
should it matter?

Cheers,
j.

--
 
S

Samwyse

January said:
There can be different, but identical substrings in $s_nogaps, which,
however, have different numbers of gaps :).

This is untested, but based on Aaron's code:

#!/usr/bin/perl -w
use strict;

my( $from, $length ) = (0,5);
my (@matches, @offsets, @lengths);

my $s = 'Y---ERI-TTKDIV----EIKRHLDYLQAPRITNNDLE';
( my $nogaps = $s ) =~ s/-//g;
my $string = substr( $nogaps, $from, $length );
my $search = join '-*', split //, $string;
print "Search: $search\n";
while ($s =~ /($search)/g) {
push @matches, $1;
push @offsets, pos $s;
push @lengths, length $1;
}

This gives you all of the matching subsequences, where they are in the
original string, and their lengths (including the gaps).

BTW, does anyone know if "study $s;" be useful for very long values of
$s? Obviously, all of these solutions need to be tested (via "use
Benchmark;") to see which works best for your typical data.
 
L

Lukas Mai

January Weiner said:
Lukas Mai said:
return ($str =~ /(?:[A-Z]-*){$from}((?:[A-Z]-*){$len})/)[0];

Could you explain why you use the (?: ) construct? It says in the perlre
that

This is for clustering, not capturing; it groups subexpressions like
"()", but doesn't make backreferences as "()" does. So

@fields = split(/\b(?:a|b|c)\b/)

is like

@fields = split(/\b(a|b|c)\b/)

but doesn't spit out extra fields. It's also cheaper not to capture
characters if you don't need to.

I assume that with (?: ) you dont get $1 and $2 in our example, but why
should it matter?

I use (?: ) by default because I need grouping more often than
capturing. I could change the pattern to
/([A-Z]-*){$from}(([A-Z]-*){$len})/ but then I'd have to use
($str =~ /.../)[1] to extract the correct match variable. Not using
plain () except where needed makes my life easier because I don't have
to count all groups to find the one I'm interested in, plus it's also
faster.

HTH, Lukas
 
A

Anno Siegel

January Weiner said:
(snip)

Thanks! This is an elegant solution to my problem. I'm running benchmarks
to see which one is quicker (say, you have hundreds sequences, each with a
length of a couple of thousands, and you are searching for several
substrings in each sequence...).

Several substrings per sequence is good because you have to set up the
offset array only once per sequence. Looking up a substring of $s_nogaps
(given by position and length) in $s will happen in constant time,
independent of the lengths of $s, $s_nogaps and the substring.

Anno
 
T

Tad McClellan

January Weiner said:
return ($str =~ /(?:[A-Z]-*){$from}((?:[A-Z]-*){$len})/)[0];


The m// here is in list context (as part of a list slice).

I assume that with (?: ) you dont get $1 and $2 in our example,

Correct.


but why
should it matter?


Because then [0] would not be the correct index into the list,
you'd need [1] instead (to return the value of $2).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,338
Messages
2,571,783
Members
48,589
Latest member
puppyslingcarrier

Latest Threads

Top