performance surprise -- why?

J

Joe Davison

I'm searching the genome with a perl script I wrote and encountered a
surprise when I tried to improve the performance -- it got worse, much
worse, and I'm wondering if there's a better way to do my second effort.

Here's the basic problem:

Given a short sequence, say AGTACT, and a chromosome, say
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT...(30MB string).

I want to find all the places in the chromosome where the sequence
occurs.

Method 1: $genome =~ s/($sequence)/\n$1/g;
I then wrote the chopped string to a file and counted the lengths of the
lines using a simple awk program (don't worry about why).

That runs in about 4 seconds on my G3 iBook. But I figured I didn't
really need a second copy of a 30MB file that differed only in the
placement and number of newlines, so why not just save the positions
where it starts? I looked at somebody else's code and tried:

Method 2: $_=$genome;
while( m/$sequence/g) { push @indices,pos();}
and then write out the indices.

I only waited half an hour on my iBook before I killed that one.

I tried again with a couple of smaller files on my G4 desktop.


Method 5MB time 11MB time
1 2.6 sec 5.1 sec
2 48.1 sec 3:20.2 = 200.2 sec

I could give more details, but I tried things like preallocating the
space for @indices without much difference.

It also seems that method 2 is O(N^2). I figure pos() must count chars
from the front every time.

Isn't there a better way to get the position of the last match?

I recognize the benefit in not breaking out of the RE engine, but I'm
surprised at the difference.

Am I missing something completely?

joe
 
K

kent fritz

Try:
Method 2: $_=$genome;
while( m/$sequence/go) { push @indices,pos();}

I think your regex is recompiling every time. It wouldn't in method #1.
 
U

Uri Guttman

kf> Try:
kf> Method 2: $_=$genome;
kf> while( m/$sequence/go) { push @indices,pos();}

kf> I think your regex is recompiling every time. It wouldn't in method #1.

BZZZT! that makes no sense. in neither case is the regex recompiling as
$sequence doesn't change (and /o is fairly obsolete now).

my guess is the difference between munging a string and allocating
scalars for the push. in the first case perl will start building up a
new string and just append stuff as the s///g proceeds so that is fairly
fast. in the second case a new scalar has to be allocated (which takes
time) for each match.

uri
 
M

Matt Garrish

Joe Davison said:
I'm searching the genome with a perl script I wrote and encountered a
surprise when I tried to improve the performance -- it got worse, much
worse, and I'm wondering if there's a better way to do my second effort.

Here's the basic problem:

Given a short sequence, say AGTACT, and a chromosome, say
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT...(30MB string).

I want to find all the places in the chromosome where the sequence
occurs.

Method 1: $genome =~ s/($sequence)/\n$1/g;
I then wrote the chopped string to a file and counted the lengths of the
lines using a simple awk program (don't worry about why).

That runs in about 4 seconds on my G3 iBook. But I figured I didn't
really need a second copy of a 30MB file that differed only in the
placement and number of newlines, so why not just save the positions
where it starts? I looked at somebody else's code and tried:

Method 2: $_=$genome;
while( m/$sequence/g) { push @indices,pos();}
and then write out the indices.

I only waited half an hour on my iBook before I killed that one.

You want foreach, otherwise you get an endless loop (because if the pattern
matches it will keep on matching forever if you're not operating on the
string). As an example, try the following:

my $str = 123454321;

foreach ($str =~ s/1/2/) {
print 1;
}

while ($str =~ /2/) {
print 2;
}

And see if you don't get a whole lot of 2's... : )

Matt
 
M

Matt Garrish

Matt Garrish said:
You want foreach, otherwise you get an endless loop (because if the pattern
matches it will keep on matching forever if you're not operating on the
string). As an example, try the following:

my $str = 123454321;

foreach ($str =~ s/1/2/) {
print 1;
}

while ($str =~ /2/) {
print 2;
}

And see if you don't get a whole lot of 2's... : )

That's what I get for skimming. I jumped to conclusions based on your
observation without looking at the modifiers on your regex. The above is not
pertinent in this case...

Matt
 
A

Anno Siegel

Joe Davison said:
I'm searching the genome with a perl script I wrote and encountered a
surprise when I tried to improve the performance -- it got worse, much
worse, and I'm wondering if there's a better way to do my second effort.

Here's the basic problem:

Given a short sequence, say AGTACT, and a chromosome, say
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT...(30MB string).

I want to find all the places in the chromosome where the sequence
occurs.

Method 1: $genome =~ s/($sequence)/\n$1/g;
I then wrote the chopped string to a file and counted the lengths of the
lines using a simple awk program (don't worry about why).

That runs in about 4 seconds on my G3 iBook. But I figured I didn't
really need a second copy of a 30MB file that differed only in the
placement and number of newlines, so why not just save the positions
where it starts? I looked at somebody else's code and tried:

Method 2: $_=$genome;
while( m/$sequence/g) { push @indices,pos();}
and then write out the indices.

I only waited half an hour on my iBook before I killed that one.

I tried again with a couple of smaller files on my G4 desktop.


Method 5MB time 11MB time
1 2.6 sec 5.1 sec
2 48.1 sec 3:20.2 = 200.2 sec

That is unexpected. The substitution method must move parts of the
string for every match, so I'd expect it to be slower than global
matching.

I benchmarked both, and also a method based on the index() function.
The results show indexing and global matching in the same ballpark,
both almost twice as fast as substitution (code appended below):

substitute 13.6/s -- -41% -45%
indexing 23.1/s 69% -- -7%
globmatch 24.8/s 82% 7% --

This was on a slow matching with much smaller (200 K) samples, but
the dependence on size should be largely linear. Where your
quadratic (well, more-than-linear) behavior comes from is anybody's
guess.

Anno

=============================================================

#!/usr/local/bin/perl
use strict; use warnings; $| = 1;
use Vi::QuickFix;

use Benchmark ':all';

use constant SIZE => 400_000;
my $sequence = 'AGTACT';
my $str;
$str .= ( qw( A C G T))[ rand 4] for 1 .. SIZE;

goto bench;

print "globmatch: ", globmatch(), "\n";
print "substitute: ", substitute(), "\n";
print "indexing: ", indexing(), "\n";
exit;

bench:
cmpthese( -1, {
globmatch => 'globmatch',
substitute => 'substitute',
indexing => 'indexing',
});

######################################################################

sub globmatch {
my @indices;
$_ = $str;
push @indices, pos while /$sequence/g;
scalar @indices;
}

sub substitute {
$_ = $str;
s/$sequence/\n$sequence/g;
tr/\n//;
}

sub indexing {
$_ = $str;
my @indices;
my $pos = 0;
while ( 1 ) {
last if ( $pos = index( $_, $sequence, $pos)) < 0;
push @indices, $pos;
$pos += length $sequence;
}
scalar @indices;
}
 
J

Joe Davison

That is unexpected. The substitution method must move parts of the
string for every match, so I'd expect it to be slower than global
matching.

I benchmarked both, and also a method based on the index() function.
The results show indexing and global matching in the same ballpark,
both almost twice as fast as substitution (code appended below):

substitute 13.6/s -- -41% -45%
indexing 23.1/s 69% -- -7%
globmatch 24.8/s 82% 7% --

This was on a slow matching with much smaller (200 K) samples, but
the dependence on size should be largely linear. Where your
quadratic (well, more-than-linear) behavior comes from is anybody's
guess.

Anno

Thanks.
I took your program and modified it so that instead of generating the
string you're matching, it read if from a file, and took the file name
from the command line -- so I could feed it the same files I'd used
before.

My files are k1, k10, k50, k100 and k200
(k1 being 1000 lines, k10 being 10,000 lines, etc)
the file sizes are:
k1 59K
k10 595K
k50 2M
k100 5M
k200 11M

The files have newlines every 60 char or so, but I strip them before
running the test. -- Oh yes, I moved the nl stripping out of substitute(),
and did it last.

Here's my results -- the time line is for the whole program, which
includes time to read the file and strip the newlines.
Interesting that on my system
867 MHz Apple G4 1.25GB ram, Mac OS X(10.3.5), perl 5.8.1
substitute is close to indexing, both significantly faster than
globmatch. Looks like that may be worth trying, at any rate.

I don't understand the percentages being displayed here --

k1:
Rate globmatch substitute indexing
globmatch 355/s -- -56% -60%
substitute 815/s 129% -- -7%
indexing 878/s 147% 8% --

time: 3.08s user 1.68s system 90% cpu 5.261 total


k10:
Rate globmatch substitute indexing
globmatch 3.60/s -- -95% -96%
substitute 67.6/s 1777% -- -20%
indexing 84.6/s 2248% 25% --

time: 2.91s user 1.54s system 90% cpu 4.922 total

k50:
(warning: too few iterations for a reliable count)
Rate globmatch substitute indexing
globmatch 9.77e-02/s -- -99% -99%
substitute 11.3/s 11492% -- -20%
indexing 14.2/s 14399% 25% --

time: 10.38s user 13.20s system 89% cpu 26.479 total

k100:
(warning: too few iterations for a reliable count)
Rate globmatch substitute indexing
globmatch 2.42e-02/s -- -100% -100%
substitute 5.71/s 23477% -- -16%
indexing 6.80/s 27941% 19% --

time: 34.66s user 50.17s system 88% cpu 1:35.80 total

k200:
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
Rate globmatch substitute indexing
globmatch 5.65e-03/s -- -100% -100%
substitute 2.83/s 50017% -- -17%
indexing 3.42/s 60440% 21% --

time: 142.39s user 214.24s system 85% cpu 6:56.92 total

joe
 
C

ctcgag

Joe Davison said:
Thanks.
I took your program and modified it so that instead of generating the
string you're matching, it read if from a file, and took the file name
from the command line -- so I could feed it the same files I'd used
before.

Let us crawl before we walk. Run the code exactly like Anno posted it.

If globmatch is dogging it when you do that, we know that the problem
lies in your machine or perl installation, not in your Perl program.

(I tested it up to SIZE of 400_000_000, and globmatch always
came out on top)

Xho
 
J

Joe Davison

Let us crawl before we walk. Run the code exactly like Anno posted
it.

If globmatch is dogging it when you do that, we know that the problem
lies in your machine or perl installation, not in your Perl program.

(I tested it up to SIZE of 400_000_000, and globmatch always
came out on top)

Xho


I did that before I modified it. Here's the result -- notice that
globmatch is fastest in this case.



Rate substitute indexing globmatch
substitute 104/s -- -21% -49%
indexing 131/s 26% -- -36%
globmatch 206/s 98% 57% --

time: 4.58s user 1.04s system 94% cpu 5.970 total


The changes I made to his code were trivial, here's my version:

----------------------
#!/usr/bin/perl
use strict; use warnings; $| = 1;
#use Vi::QuickFix;

use English;

use Benchmark ':all';

use constant SIZE => 400_000;
my ($sequence,$ifile,$whatItWas);
my $str;

$sequence = shift @ARGV;
$ifile = shift @ARGV;

unless (open INPUT, $ifile) {
print STDERR "Can't open $ifile: $! \n";
next FILE;
}


# Suck the whole ifile in for processing

$str=<INPUT>; #toss the first record (the >CHR1v03212003 line in one case...)
$whatItWas = $INPUT_RECORD_SEPARATOR;
undef $INPUT_RECORD_SEPARATOR;
$str = <INPUT>;
$INPUT_RECORD_SEPARATOR = $whatItWas;

close INPUT;

# Now do what we came for

$str =~ s/\n//gs; # delete newlines

goto bench;

print "globmatch: ", globmatch(), "\n";
print "indexing: ", indexing(), "\n";
print "substitute: ", substitute(), "\n";
exit;

bench:
cmpthese( -1, {
globmatch => 'globmatch',
substitute => 'substitute',
indexing => 'indexing',
});

######################################################################

sub globmatch {
my @indices;
$_ = $str;
push @indices, pos while /$sequence/g;
scalar @indices;
}

sub substitute {
$_ = $str;
s/$sequence/\n$sequence/g;
}

sub indexing {
$_ = $str;
my @indices;
my $pos = 0;
while ( 1 ) {
last if ( $pos = index( $_, $sequence, $pos)) < 0;
push @indices, $pos;
$pos += length $sequence;
}
scalar @indices;
}
 
J

Joe Davison

Here's one weirdness!

use English;

I created a file with 400_000 chars from the big file I use, to ease
comparison with Anno's original.

Here's the output form Anno's script on my system:

Rate substitute indexing globmatch
substitute 103/s -- -22% -49%
indexing 133/s 28% -- -35%
globmatch 204/s 97% 54% --

time: 4.94s user 0.99s system 91% cpu 6.500 total


Here's the output from my current version, without "Use English;"

Rate indexing substitute globmatch
indexing 125/s -- -12% -32%
substitute 142/s 13% -- -23%
globmatch 184/s 47% 30% --

time: 3.75s user 1.05s system 91% cpu 5.257 total

And here's the output from my current versin, with "Use English;"

Rate globmatch substitute indexing
globmatch 8.00/s -- -93% -94%
substitute 113/s 1314% -- -10%
indexing 125/s 1466% 11% --

time: 2.70s user 1.54s system 90% cpu 4.685 total

And, to make that more believable, here's the diff between my two
versions:
=================== diff AnnosProg1.pl AnnosProg5.pl=======
4a5,6
use English;
19d20
<
23,24c24,25
< $whatItWas = $/;
< undef $/;
---
$whatItWas = $INPUT_RECORD_SEPARATOR;
undef $INPUT_RECORD_SEPARATOR;
28c29
< $/ = $whatItWas;
---
$INPUT_RECORD_SEPARATOR = $whatItWas;
=============================================================

And here's the source for AnnosProg1.pl -- the one without use English.

===================================
#!/usr/bin/perl
use strict; use warnings; $| = 1;
#use Vi::QuickFix;

use Benchmark ':all';

use constant SIZE => 400_000;
my ($ifile,$whatItWas);
my ($str,$fsz);

my $sequence = 'AGTACT';

$ifile = shift @ARGV;

unless (open INPUT, $ifile) {
print STDERR "Can't open $ifile: $! \n";
}


# Suck the whole ifile in for processing

$str=<INPUT>; #toss the first record (the >CHR1v03212003 line in one case...)
$whatItWas = $/;
undef $/;
$fsz = -s $ifile;
$str = " " x $fsz; # preallocate filesized string!!
$str = <INPUT>;
$/ = $whatItWas;

close INPUT;

# Now do what we came for

$str =~ s/\n//gs; # delete newlines

goto bench;

print "globmatch: ", globmatch(), "\n";
print "indexing: ", indexing(), "\n";
print "substitute: ", substitute(), "\n";
exit;

bench:
cmpthese( -1, {
globmatch => 'globmatch',
substitute => 'substitute',
indexing => 'indexing',
});

######################################################################

sub globmatch {
my @indices;
$_ = $str;
push @indices, pos while /$sequence/g;
scalar @indices;
}

sub substitute {
$_ = $str;
s/$sequence/\n$sequence/g;
}

sub indexing {
$_ = $str;
my @indices;
my $pos = 0;
while ( 1 ) {
last if ( $pos = index( $_, $sequence, $pos)) < 0;
push @indices, $pos;
$pos += length $sequence;
}
scalar @indices;
}
 
A

Anno Siegel

[shortened]
Here's one weirdness!

use English;

I created a file with 400_000 chars from the big file I use, to ease
comparison with Anno's original.

Here's the output form Anno's script on my system:
globmatch 204/s

Here's the output from my current version, without "Use English;"
globmatch 184/s

And here's the output from my current versin, with "Use English;"
globmatch 8.00/s

"Performance surprise" indeed. Something is dreadfully wrong there.
The only reason that hasn't come up earlier is because nobody uses
the English module. Well, sorry, almost nobody :)

Time for a serious bug fix. Or to deprecate "English" officially.

Anno
 
C

ctcgag

Joe Davison said:
Here's one weirdness!

use English;

aha!

perldoc English

....
PERFORMANCE
This module can provoke sizeable inefficiencies for regular expres-
sions, due to unfortunate implementation details. If performance
matters in your application and you don't need $PREMATCH, $MATCH,
or $POSTMATCH, try doing

use English qw( -no_match_vars ) ;

Xho
 
U

Uri Guttman

A> Or one could just read the manual page. From 'man English':

A> PERFORMANCE
A> This module can provoke sizeable inefficiencies for regu­
A> lar expressions, due to unfortunate implementation
A> details. If performance matters in your application and
A> you don't need $PREMATCH, $MATCH, or $POSTMATCH, try doing

A> use English qw( -no_match_vars ) ;

and as damian conway says (and i agree) they chose the wrong default for
that option. it should have been -use_match_vars and the default for
english should be to not alias them.

uri
 
J

Joe Davison

A> Or one could just read the manual page. From 'man English':

A> PERFORMANCE
A> This module can provoke sizeable inefficiencies for
A> regu­ lar expressions, due to unfortunate
A> implementation details. If performance matters in
A> your application and you don't need $PREMATCH,
A> $MATCH, or $POSTMATCH, try doing

A> use English qw( -no_match_vars ) ;

and as damian conway says (and i agree) they chose the wrong default
for that option. it should have been -use_match_vars and the default
for english should be to not alias them.

uri

Certainly true; mea culpa. I suppose one could spend all ones time
RTFMing and never run into such problems :)

So, say I do use English qw( -no_match_vars). I presume I'd still be
able to use the non-english versions to access the match variables,
should the need arise? RTFM, aye.

I wonder, is it even possible to read all the documentation in a single
lifetime, assuming, say, one were to be paid to read it, so it didn't
interfere with earning a living. Maybe today, but as we work to
improve the documentation, it will surely become impossible.

Not surprisingly, modifying Anno's original program to use English, and again
to use English qw( -no_match_vars ) successfully demonstrates the point.

Thanks for the help. I think I'll spend some time understanding
Benchmark better -- it certainly seems to be useful.

joe
 
J

John W. Krahn

Joe said:
[snip]

#!/usr/bin/perl
use strict; use warnings; $| = 1;
#use Vi::QuickFix;

use Benchmark ':all';

use constant SIZE => 400_000;
my ($ifile,$whatItWas);
my ($str,$fsz);

my $sequence = 'AGTACT';

$ifile = shift @ARGV;

unless (open INPUT, $ifile) {
print STDERR "Can't open $ifile: $! \n";
^^^^^^^^^^^^
If the file can't be opened you print an error message but then you continue
on as if nothing had happened. You should do something that avoids trying to
read from an invalid filehandle, like exit() or die() maybe? You may also
want to use binmode() on the filehandle.

}

# Suck the whole ifile in for processing

$str=<INPUT>; #toss the first record (the >CHR1v03212003 line in one case...)
$whatItWas = $/;
undef $/;
$fsz = -s $ifile;
$str = " " x $fsz; # preallocate filesized string!!
$str = <INPUT>;
$/ = $whatItWas;

The usual idiom to slurp a file is:

my $str = do { local $/; <INPUT> };

Another alternative (if you are going to stat the file anyway) is:

( my $fsz = read INPUT, my $str, -s INPUT ) or die "read error: $!";
$fsz == -s _ or die "Read $fsz bytes, expected " . -s _ . " bytes.\n":

It seems (at least to me) that since you are slurping the whole file,
preallocating the string is not required.

You may also want to have a look at the File::Slurp module which is AFAIK very
efficient.

close INPUT;

# Now do what we came for

$str =~ s/\n//gs; # delete newlines

Use transliteration instead of substitution for speed:

$str =~ tr/\n//d; # delete newlines



John
 
U

Uri Guttman

A> Uri Guttman ([email protected]) wrote on MMMMXIII September MCMXCIII in
A> <URL:A> ""
A> "" A> Or one could just read the manual page. From 'man English':
A> ""
A> "" A> PERFORMANCE
A> "" A> This module can provoke sizeable inefficiencies for regu­
A> "" A> lar expressions, due to unfortunate implementation
A> "" A> details. If performance matters in your application and
A> "" A> you don't need $PREMATCH, $MATCH, or $POSTMATCH, try doing
A> ""
A> "" A> use English qw( -no_match_vars ) ;
A> ""
A> "" and as damian conway says (and i agree) they chose the wrong default for
A> "" that option. it should have been -use_match_vars and the default for
A> "" english should be to not alias them.


A> Well, yes, but Perl is full of things that are there because of
A> wrong decisions made in the past, and won't be changed for the sake
A> of backwards compatability.

but that was a fairly recent decision. the slowness of $& was known for
a long time and also that english triggered it. adding the ability to
not have english alias $& was a good idea. just no one seemed to think
it through and select the proper default. this is a good thing to learn
when designing apis. think carefully about which param values will be
the most common and make them defaults.

uri
 
C

ctcgag

Joe Davison said:
So, say I do use English qw( -no_match_vars). I presume I'd still be
able to use the non-english versions to access the match variables,
should the need arise? RTFM, aye.

Yes and no. You could still use them, but it would have a poor effect
on performance. Using the match variables anywhere causes all regex to
slow down[1]. And apparently "use English" triggers this same penalty
even if you don't use explicitly use the match variables anywhere. So you
could use them, but would get the same performance problem.

Xho

[1] The docs say they slow down all regex in the program, but I think one
of the smarty-pants here said it actually only slows down all regex in the
scopes containing the use.
 
A

Anno Siegel

["use English" causes regex slowdown]
A> Well, yes, but Perl is full of things that are there because of
A> wrong decisions made in the past, and won't be changed for the sake
A> of backwards compatability.

but that was a fairly recent decision. the slowness of $& was known for
a long time and also that english triggered it. adding the ability to
not have english alias $& was a good idea. just no one seemed to think
it through and select the proper default. this is a good thing to learn
when designing apis. think carefully about which param values will be
the most common and make them defaults.

Another option would be to conserve the documented behavior, so that
the speed penalty only occurs when $MATCH, etc. are used. One could
tie them first, so they only alias (and untie) themselves on first
usage. I don't particularly care to work out the details (a simple-
minded attempt didn't work immediately), but I'm sure it can be done.

Anno
 
U

Uri Guttman

c> Yes and no. You could still use them, but it would have a poor effect
c> on performance. Using the match variables anywhere causes all regex to
c> slow down[1]. And apparently "use English" triggers this same penalty
c> even if you don't use explicitly use the match variables anywhere. So you
c> could use them, but would get the same performance problem.

the issue is that if the perl compiler sees $& (or its siblings)
anywhere in the code, it will enable copying on s///. note that this
should affect only s/// and not m//. the need for a copy is because $&
refers to the matched string and that may have been destroyed by the
replacement of s///. m// doesn't change anything and so $& could refer
to the original string. now, if the original string is then changed, $&
would need a copy so maybe m/// does trigger it. $& is a global so that
is why it can be anywhere in the code and trigger the copy flag. English
aliases $& to some word (who cares, i never use it) and the appearance
of $& in there triggered the copy flag. the new option causes $& to not
be seen by the compiler (if wanted, it is executed with a string
eval). as i said, this was dumb and they should have inverted the logic
of the English flag so $& is not aliased by default.

uri
 
A

Anno Siegel

Abigail said:
Anno Siegel ([email protected]) wrote on MMMMXIV September
MCMXCIII in <URL:;;
;; ["use English" causes regex slowdown]
[...]

;; Another option would be to conserve the documented behavior, so that
;; the speed penalty only occurs when $MATCH, etc. are used. One could
;; tie them first, so they only alias (and untie) themselves on first
;; usage. I don't particularly care to work out the details (a simple-
;; minded attempt didn't work immediately), but I'm sure it can be done.

That wouldn't work. $& etc can only work if work is done when the

Not sure what you are saying here...
match occurs - if you delay it till $& is actually used, the necessary
information is gone. That's why the penalty occurs if the *compiler*
detect $&, not the runtime.

....but the meaning is clear: The regex engine must know it is expected
to save all matches when the match occurs. It won't do to tell it after-
wards. Pretty obvious, put that way.

[demo snipped]

So much for that idea. Thanks for shooting it down. Back to
deprecating "English", then :)

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,159
Messages
2,570,883
Members
47,419
Latest member
ArturoBres

Latest Threads

Top