Anything to be done about utf8 regexp performance?

Jochen Lehmeier · Nov 3, 2009

Hello,

perl -V|head

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
Platform:
osname=linux, osvers=2.6.22-3-k7, archname=i486-linux-gnu-thread-multi
uname='linux k 2.6.22-3-k7 #1 smp mon oct 22 22:51:54 utc 2007 i686
gnulinux

cat test.pl

#!/usr/local/bin/perl

use strict;
use warnings;

my $a = "a".("x" x 1000);
my $b = "\x{1234}".("x" x 1000);

for (0..1000)
{
$a =~ s/r/xxx/;
$a =~ s/r/xxx/i;
$b =~ s/r/xxx/;
$b =~ s/r/xxx/i;
}

perl -d:SmallProf test.pl

^L ================ SmallProf version 2.02 ================
Profile of test.pl
Page 94
=================================================================
count wall tm cpu time line
0 0.00000 0.00000 1:#!/usr/local/bin/perl
0 0.00000 0.00000 2:
0 0.00000 0.00000 3:use strict;
0 0.00000 0.00000 4:use warnings;
0 0.00000 0.00000 5:
1 0.00005 0.00000 6:my $a = "a".("x" x 1000);
1 0.00006 0.00000 7:my $b = "\x{1234}".("x" x 1000);
0 0.00000 0.00000 8:
1 0.00000 0.00000 9:for (0..1000)
0 0.00000 0.00000 10:{
1001 0.00596 0.07000 11: $a =~ s/r/xxx/;
1001 0.01276 0.03000 12: $a =~ s/r/xxx/i;
1001 0.04787 0.14000 13: $b =~ s/r/xxx/;
1004 2.05547 2.10000 14: $b =~ s/r/xxx/i;
0 0.00000 0.00000 15:}

I can live with line 13, but line 14 is not funny anymore. 344 times
slower than a latin1 regexp... or 161 times slower than a
latin1-case-insentitive one.

I understand that case calculations are much more complex in utf8 than
latin1. Is there anything that can be done, anyway?

Eric Pozharski · Nov 4, 2009

On 2009-11-03 said:
#!/usr/local/bin/perl

use strict;
use warnings;

my $a = "a".("x" x 1000);
my $b = "\x{1234}".("x" x 1000);

for (0..1000)
{
$a =~ s/r/xxx/;
$a =~ s/r/xxx/i;
$b =~ s/r/xxx/;
$b =~ s/r/xxx/i;
}
*SKIP*
I can live with line 13, but line 14 is not funny anymore. 344 times
slower than a latin1 regexp... or 161 times slower than a
latin1-case-insentitive one.

I understand that case calculations are much more complex in utf8 than
latin1. Is there anything that can be done, anyway?

HTH (as you can see, that idea has it's limitations):

#!/usr/bin/perl

use strict;
use warnings;
use Benchmark qw{ cmpthese timethese };

my $a = "a" . ("x" x 1000);
my $b = "\x{1234}" . ("x" x 1000);

cmpthese timethese -3, {
code00 => sub { $a =~ s/r/xxx/i; },
code01 => sub { $b =~ s/r/xxx/i; },
code02 => sub { $b =~ s/[rR]/xxx/; },
};

__END__
Benchmark: running code00, code01, code02 for at least 3 CPU seconds...
code00: 2 wallclock secs ( 3.02 usr + 0.00 sys = 3.02 CPU) @ 316342.72/s (n=955355)
code01: 4 wallclock secs ( 3.20 usr + 0.00 sys = 3.20 CPU) @ 4509.38/s (n=14430)
code02: 2 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ 57964.86/s (n=181430)
Rate code01 code02 code00
code01 4509/s -- -92% -99%
code02 57965/s 1185% -- -82%
code00 316343/s 6915% 446% --

utf8 and chomp	13	Feb 22, 2009
Obscure utf8/"panic: malloc" bug (II)	0	Feb 6, 2006
Regex testing and UTF8 awarenes or Regex and numeric pattern matching	2	Mar 10, 2009
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
DBD::Oracle, Unicode, non-UTF8-non-ASCII strings	0	Jul 23, 2009
why utf8::upgrade is needed?	7	Jul 10, 2004
Python code problem	2	Apr 23, 2023
Malformed utf8; where's the null byte coming from?	6	Jun 28, 2006

Anything to be done about utf8 regexp performance?

Jochen Lehmeier

Eric Pozharski

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads