K
Koszalek Opalek
#!/usr/bin/perl
=pod
The code below benchmarks two regexp's that look for lines
starting with the equals sign in a multiline string. The
regexps differ only by how the line break is detected.
The first regexp uses the ^ metacharacter and the /m flag:
qr{\G^=[^\n]*}ism
The other relies on a negative look-behind assertion:
qr{\G(?<=\n)=[^\n]*}ism
One difference between the two regexps is that the ^
version matches '=' at the beginning of the string,
whereas the other does not. But there is something
else as well. The second version is at least 50 times
faster!
Note that both regexps use the \G assertion (match only at
pos()) -- and the position is set to a random number in each
loop iteration. I assumed both regexp's will be very fast
(because the have to be checked only at one pos in string)
-- apparently not so.
Could someone explain what's going behind the scenes in
the regexp engine? Is it scanning the complete string for
line breaks if I use ^, even though it has to match only
at pos() ?
K.
=cut
use strict;
use Time::HiRes qw( time );
$| = 1;
my $gibberish;
for( 1 .. 1000 ) {
for( 1 .. int(rand 50) ) {
$gibberish .= chr( int( rand 60) + 32 );
};
$gibberish .= "\n";
}
my $l = length $gibberish;
my $cnt = 100_000;
my @positions;
for( 1 .. $cnt ) { push @positions, int( rand $l) };
print "String length: $l.\n\n";
for my $re (
qr{\G(?<=\n)=[^\n]*}ism,
qr{\G^=[^\n]*}ism,
) {
my $succ = 0;
my $start = time;
foreach ( @positions ) {
pos $gibberish = $_;
$succ++ if( $gibberish =~ m/$re/g );
};
print "Regexp: $re.\n";
print "Successful matches $succ.\n";
printf "Time = %f.\n\n", time - $start;
};
print "$cnt matches for each regexp.\n";
=pod
The code below benchmarks two regexp's that look for lines
starting with the equals sign in a multiline string. The
regexps differ only by how the line break is detected.
The first regexp uses the ^ metacharacter and the /m flag:
qr{\G^=[^\n]*}ism
The other relies on a negative look-behind assertion:
qr{\G(?<=\n)=[^\n]*}ism
One difference between the two regexps is that the ^
version matches '=' at the beginning of the string,
whereas the other does not. But there is something
else as well. The second version is at least 50 times
faster!
Note that both regexps use the \G assertion (match only at
pos()) -- and the position is set to a random number in each
loop iteration. I assumed both regexp's will be very fast
(because the have to be checked only at one pos in string)
-- apparently not so.
Could someone explain what's going behind the scenes in
the regexp engine? Is it scanning the complete string for
line breaks if I use ^, even though it has to match only
at pos() ?
K.
=cut
use strict;
use Time::HiRes qw( time );
$| = 1;
my $gibberish;
for( 1 .. 1000 ) {
for( 1 .. int(rand 50) ) {
$gibberish .= chr( int( rand 60) + 32 );
};
$gibberish .= "\n";
}
my $l = length $gibberish;
my $cnt = 100_000;
my @positions;
for( 1 .. $cnt ) { push @positions, int( rand $l) };
print "String length: $l.\n\n";
for my $re (
qr{\G(?<=\n)=[^\n]*}ism,
qr{\G^=[^\n]*}ism,
) {
my $succ = 0;
my $start = time;
foreach ( @positions ) {
pos $gibberish = $_;
$succ++ if( $gibberish =~ m/$re/g );
};
print "Regexp: $re.\n";
print "Successful matches $succ.\n";
printf "Time = %f
};
print "$cnt matches for each regexp.\n";