J
Joe Davison
I'm searching the genome with a perl script I wrote and encountered a
surprise when I tried to improve the performance -- it got worse, much
worse, and I'm wondering if there's a better way to do my second effort.
Here's the basic problem:
Given a short sequence, say AGTACT, and a chromosome, say
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT...(30MB string).
I want to find all the places in the chromosome where the sequence
occurs.
Method 1: $genome =~ s/($sequence)/\n$1/g;
I then wrote the chopped string to a file and counted the lengths of the
lines using a simple awk program (don't worry about why).
That runs in about 4 seconds on my G3 iBook. But I figured I didn't
really need a second copy of a 30MB file that differed only in the
placement and number of newlines, so why not just save the positions
where it starts? I looked at somebody else's code and tried:
Method 2: $_=$genome;
while( m/$sequence/g) { push @indices,pos();}
and then write out the indices.
I only waited half an hour on my iBook before I killed that one.
I tried again with a couple of smaller files on my G4 desktop.
Method 5MB time 11MB time
1 2.6 sec 5.1 sec
2 48.1 sec 3:20.2 = 200.2 sec
I could give more details, but I tried things like preallocating the
space for @indices without much difference.
It also seems that method 2 is O(N^2). I figure pos() must count chars
from the front every time.
Isn't there a better way to get the position of the last match?
I recognize the benefit in not breaking out of the RE engine, but I'm
surprised at the difference.
Am I missing something completely?
joe
surprise when I tried to improve the performance -- it got worse, much
worse, and I'm wondering if there's a better way to do my second effort.
Here's the basic problem:
Given a short sequence, say AGTACT, and a chromosome, say
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT...(30MB string).
I want to find all the places in the chromosome where the sequence
occurs.
Method 1: $genome =~ s/($sequence)/\n$1/g;
I then wrote the chopped string to a file and counted the lengths of the
lines using a simple awk program (don't worry about why).
That runs in about 4 seconds on my G3 iBook. But I figured I didn't
really need a second copy of a 30MB file that differed only in the
placement and number of newlines, so why not just save the positions
where it starts? I looked at somebody else's code and tried:
Method 2: $_=$genome;
while( m/$sequence/g) { push @indices,pos();}
and then write out the indices.
I only waited half an hour on my iBook before I killed that one.
I tried again with a couple of smaller files on my G4 desktop.
Method 5MB time 11MB time
1 2.6 sec 5.1 sec
2 48.1 sec 3:20.2 = 200.2 sec
I could give more details, but I tried things like preallocating the
space for @indices without much difference.
It also seems that method 2 is O(N^2). I figure pos() must count chars
from the front every time.
Isn't there a better way to get the position of the last match?
I recognize the benefit in not breaking out of the RE engine, but I'm
surprised at the difference.
Am I missing something completely?
joe