how to speed up a string-substitution loop?

A

Adam Kellas

Hi,

Looking for suggestions on how to speed up the function below. It's
intended to "re-macroize" the output of make; in other words given a
sequence of command lines generated by make, and a set of make macros,
I need to substitute in the make variables such that "gcc -c -g -O2 -
Wall -DNDEBUG" might become (say) "$(CC) -c $(CFLAGS) $(DFLAGS)", as
much as possible as it was in the original Makefile. The %Variables
hash maps strings to macro names; thus with the above example we would
have

$Variables{'gcc'} = '$(CC)';
$Variables{'-g -O2 -Wall'} = '$(CFLAGS)';
$Variables{'-DNDEBUG'} = '$(DFLAGS)';

Anyway, the function below seems to work but scales very badly and
becomes unusable when enough variables are in use. Any ideas on how to
write it better?

sub varify {
my $word = shift;
$word =~ s%\$%\$\$%g;
for my $substr (keys %Variables) {
while ((my $start = index($word, $substr)) >= 0) {
substr($word, $start, length($substr)) = $Variables{$substr};
}
}
return $word;
}

Thanks,
AK
 
J

Jim Gibson

Adam Kellas said:
Hi,

Looking for suggestions on how to speed up the function below. It's
intended to "re-macroize" the output of make; in other words given a
sequence of command lines generated by make, and a set of make macros,
I need to substitute in the make variables such that "gcc -c -g -O2 -
Wall -DNDEBUG" might become (say) "$(CC) -c $(CFLAGS) $(DFLAGS)", as
much as possible as it was in the original Makefile. The %Variables
hash maps strings to macro names; thus with the above example we would
have

$Variables{'gcc'} = '$(CC)';
$Variables{'-g -O2 -Wall'} = '$(CFLAGS)';
$Variables{'-DNDEBUG'} = '$(DFLAGS)';

Anyway, the function below seems to work but scales very badly and
becomes unusable when enough variables are in use. Any ideas on how to
write it better?

sub varify {
my $word = shift;
$word =~ s%\$%\$\$%g;
for my $substr (keys %Variables) {
while ((my $start = index($word, $substr)) >= 0) {
substr($word, $start, length($substr)) = $Variables{$substr};
}
}
return $word;
}

I don't see how you are going to get much of a speedup. It seems you
are already doing the minimum amount of work with no wasted steps.

You might try using the each() function instead of keys. That saves
generating the key array and the hash lookup for the replacement
string:

while( my($key,$replace) = each( %Variables) ) {
...
substr($word, $start, length($substr)) = $replace;
}

You could also pre-compute the replacement string lengths so you don't
have to call the length() function for each replacement. Thus, you
might be better off using three arrays or a two-dimensional (N,3) array
to hold (key,replacement,length(replacement)) values.

How many key:replacement pairs are you using? I am surprised this
doesn't scale very well. It would appear to be O(n) in the number of
search strings.

As always, only benchmarking can ensure you are getting any speedups or
tell you where the actual bottlenecks are.
 
U

Uri Guttman

AK> Hi,
AK> Looking for suggestions on how to speed up the function below. It's
AK> intended to "re-macroize" the output of make; in other words given a
AK> sequence of command lines generated by make, and a set of make macros,
AK> I need to substitute in the make variables such that "gcc -c -g -O2 -
AK> Wall -DNDEBUG" might become (say) "$(CC) -c $(CFLAGS) $(DFLAGS)", as
AK> much as possible as it was in the original Makefile. The %Variables
AK> hash maps strings to macro names; thus with the above example we would
AK> have

AK> $Variables{'gcc'} = '$(CC)';
AK> $Variables{'-g -O2 -Wall'} = '$(CFLAGS)';
AK> $Variables{'-DNDEBUG'} = '$(DFLAGS)';

AK> sub varify {
AK> my $word = shift;
AK> $word =~ s%\$%\$\$%g;
AK> for my $substr (keys %Variables) {
AK> while ((my $start = index($word, $substr)) >= 0) {
AK> substr($word, $start, length($substr)) = $Variables{$substr};
AK> }
AK> }
AK> return $word;
AK> }

can you make a pattern that would match ANY of the strings you want to
match? even a alternation might do well enough. then you can do a
single s/// with the replacement value being looked up in the $variables
(poor name) hash.

also why would speed be an issue here? it looks like it would be done
off line to fix up makefile output.

uri
 
J

Jürgen Exner

Adam Kellas said:
Looking for suggestions on how to speed up the function below. It's
intended to "re-macroize" the output of make; in other words given a
sequence of command lines generated by make, and a set of make macros,
I need to substitute in the make variables such that "gcc -c -g -O2 -
Wall -DNDEBUG" might become (say) "$(CC) -c $(CFLAGS) $(DFLAGS)", as
much as possible as it was in the original Makefile. The %Variables
hash maps strings to macro names; thus with the above example we would
have

$Variables{'gcc'} = '$(CC)';
$Variables{'-g -O2 -Wall'} = '$(CFLAGS)';
$Variables{'-DNDEBUG'} = '$(DFLAGS)';

Anyway, the function below seems to work but scales very badly and
becomes unusable when enough variables are in use. Any ideas on how to
write it better?

sub varify {
my $word = shift;
$word =~ s%\$%\$\$%g;
for my $substr (keys %Variables) {
while ((my $start = index($word, $substr)) >= 0) {
substr($word, $start, length($substr)) = $Variables{$substr};

Probably I am missing the obvious,but why are you doing the replacements
manually, thus incurring a lot of string copy, instead of simply doing a
s///?
I would replace the whole while loop with a straight-forward
s/$substr/$Variables{$substr}/g;
May have to add \Q...\E if needed.

BTW: $substr is an awful name considering there is a function substr()
and capitalized names ($Variables) normally indicate file handles.

jue
 
A

Adam Kellas

can you make a pattern that would match ANY of the strings you want to
match? even a alternation might do well enough. then you can do a
single s/// with the replacement value being looked up in the $variables
(poor name) hash.

Thanks, I will try this.
also why would speed be an issue here? it looks like it would be done
off line to fix up makefile output.

You're right, this is done more or less off line and in theory should
not be too performance sensitive. But this exhibits really
pathological behavior - for reasons I don't understand, though it
works fine in small unit-test setups it can appear to hang for hours
on end in real-world situations. During that time strace shows that
perl is calling the brk() system call over and over.

Thanks,
AK
 
A

Adam Kellas

Probably I am missing the obvious,but why are you doing the replacements
manually, thus incurring a lot of string copy, instead of simply doing a
s///?
I would replace the whole while loop with a straight-forward
        s/$substr/$Variables{$substr}/g;
May have to add \Q...\E if needed.

BTW: $substr is an awful name considering there is a function substr()
and capitalized names ($Variables) normally indicate file handles.

I figured that s/// would be slower than the index/substr technique,
which in theory should resolve to a strstr(), strcpy, and realloc() in
the perl binary. Or at least so I thought.

AK
 
J

Jürgen Exner

Adam Kellas said:
I figured that s/// would be slower than the index/substr technique,

I'd guess it's a simple enough change to just give it a try. If you
really need hard data then there's always the Benchmark module.

jue
 
U

Uri Guttman

AK> Thanks, I will try this.

AK> You're right, this is done more or less off line and in theory should
AK> not be too performance sensitive. But this exhibits really
AK> pathological behavior - for reasons I don't understand, though it
AK> works fine in small unit-test setups it can appear to hang for hours
AK> on end in real-world situations. During that time strace shows that
AK> perl is calling the brk() system call over and over.

you would have to post all the code and some data and hopefully that
would exhibit the problem. calling brk() means you are doing serious
data allocation which shouldn't happen in this style of code. maybe you
are doing something else that you didn't tell us. s/// ops on a string
may add some ram needs but not insane amounts. massive unnecessary ram
needs are usually leaks, either a rare perl bug or some poorly designed
data structure that leaks.

uri
 
U

Uri Guttman

BM> As a rule of thumb, any single perl op (s/// is a single op) is likely
BM> to be faster than the equivalent algorithm using multiple ops (substr,
BM> index, the assignment and the while loop add up to quite a lot of ops
BM> between them). Lvalue substr is also likely to be slower than 4-arg
BM> substr (since it's much more magical).

totally agree. perl guts are almost always faster than the equivilent in
perl lang. the rule is to try to stay inside perl as much as
possible. there are a few exceptions but not in this case.

BM> Your post xthread (about the process sitting in brk(2)) suggests a lot
BM> of string copying. You could try running through the source string and
BM> building a single result string as an alternative, which should avoid
BM> most of the copying. If you have some idea of an upper bound on the
BM> length of the destination string (if your de-substitutions always make
BM> the string shorter, for instance) you could even try preallocating the
BM> destination string like this:

BM> my $dest = "x" x $dest_size;
BM> $dest = "";

and my solution of a single s///g with all the keys and lookup in a hash
for the replacement will speed it up. fewer ops and much less realloc as
it will build up a single new string as it does the replacements. it may
realloc some but nowhere like with the lvalue substr and such the OP
used.

uri
 
A

Adam Kellas

you would have to post all the code and some data and hopefully that
would exhibit the problem. calling brk() means you are doing serious
data allocation which shouldn't happen in this style of code. maybe you
are doing something else that you didn't tell us. s/// ops on a string
may add some ram needs but not insane amounts. massive unnecessary ram
needs are usually leaks, either a rare perl bug or some poorly designed
data structure that leaks.

For the record, this rewrite seems to solve the problem:

sub varify {
my $text = shift;
$text =~ s%\$%\$\$%g;
for (keys %MakeVars) {
my $name = $MakeVars{$_};
$text =~ s%$_%$name%g;
}
return $text;
}

Here the hash keys are precompiled REs, e.g.

my $re = qr/\Q$text\E/;
$MakeVars{$re} = $macro_name;

Thanks for the advice,
AK
 
U

Uri Guttman

AK> For the record, this rewrite seems to solve the problem:

AK> sub varify {
AK> my $text = shift;
AK> $text =~ s%\$%\$\$%g;

why the strange use of % for a delimiter? it is impossible to
read. actually one of the worst choices you can make IMO. besides you
don't need to do that as / is fine since you have no / in your regex.

AK> for (keys %MakeVars) {

use a named variable for loops. it is cleaner and easier to read. $_ can
be easily clobbered or some action at a distance can happen. my vars are
lexical to the loop and much safer

AK> my $name = $MakeVars{$_};
AK> $text =~ s%$_%$name%g;

again with the % delimiter. i would go blind reading that. and there is
no need for the temp $name as you can put that hash lookup directly in
the replacement string.

AK> }
AK> return $text;
AK> }

AK> Here the hash keys are precompiled REs, e.g.

AK> my $re = qr/\Q$text\E/;

you don't use % there, so why in the s/// ops?

AK> $MakeVars{$re} = $macro_name;

again, you don't need the temp var. and that will cause you to lose the
precompiled nature of qr. all keys in a hash are just plain strings (not
perl values). it may work anyhow but if you had something that needed to
be a real regex feature that broke when putting it in as a hash key, you
would be very sorry. why don't you make the hash based on the name
instead with the qr// as the value? it will always work correctly as a regex.

uri
 
J

Jürgen Exner

Adam Kellas said:
For the record, this rewrite seems to solve the problem:

As execution speed was the issue at hand ...
sub varify {
my $text = shift;
$text =~ s%\$%\$\$%g;
for (keys %MakeVars) {
my $name = $MakeVars{$_};
$text =~ s%$_%$name%g;

.... I would get rid of that unnecessary temporary variable $name and
thus save a string copy:
$text =~ s%$_%$MakeVars{$_}%g;

BTW: I find your % signs as s///-delimeters hard to read. Yes, they are
legal, yes there are occasions where using the standard slash is awkward
because you have to escape slashes that are part of the RE or the
replacement string. But none of those reasons exist here.

jue
 
A

Adam Kellas

why the strange use of % for a delimiter? it is impossible to
read. actually one of the worst choices you can make IMO. besides you
don't need to do that as / is fine since you have no / in your regex.

Style ... aesthetics ... religion ... I decided years ago that my
personal style guide would be to always use m%% and s%%%. It may be
ugly but it's consistent; I can always find patterns in my code by
searching for m% or s%. Try that with //! The other problem with // is
that half the time there are slashes in the pattern, and often the
pattern changes in such a way that you either can or have to change
the delimiter. I've never regretted adding this convention to my style
guide, though I'm sure a case could be made for preferring some other
inert character like s,,,.
  AK>     for (keys %MakeVars) {

use a named variable for loops. it is cleaner and easier to read. $_ can
be easily clobbered or some action at a distance can happen. my vars are
lexical to the loop and much safer

Use of $_ was deliberate here because I gather it's marginally faster
and I was trying to tune the loop for speed. But in general I agree
that named variables are preferable.
  AK>        my $name = $MakeVars{$_};
  AK>        $text =~ s%$_%$name%g;

again with the % delimiter. i would go blind reading that. and there is
no need for the temp $name as you can put that hash lookup directly in
the replacement string.

Style ... aesthetics ... religion ...
again, you don't need the temp var. and that will cause you to lose the
precompiled nature of qr. all keys in a hash are just plain strings (not
perl values). it may work anyhow but if you had something that needed to
be a real regex feature that broke when putting it in as a hash key, you
would be very sorry. why don't you make the hash based on the name
instead with the qr// as the value? it will always work correctly as a regex.

Thanks, I was not aware of that and will make changes as suggested.

AK
 
A

Adam Kellas

BTW: I find your % signs as s///-delimeters hard to read. Yes, they are
legal, yes there are occasions where using the standard slash is awkward
because you have to escape slashes that are part of the RE or the
replacement string. But none of those reasons exist here.

As noted, this is a deliberate tradeoff. Yes, it requires an
additional keystroke (or 4 if you count the shift key) and may
reasonably be called ugly. But OTOH I've legislated leaning toothpick
syndrome out of existence, and I've never once had to escape a % in a
regular expression. YMMV.

AK
 
A

Adam Kellas

BTW: $substr is an awful name considering there is a function substr()

Granted, changed.
and capitalized names ($Variables) normally indicate file handles.

OT I know, but I thought ALLCAPS names indicated file handles (as in
STDOUT) and CamelCase indicated global (or more properly "widely
scoped") variables?

AK
 
U

Uri Guttman

AK> Style ... aesthetics ... religion ... I decided years ago that my
AK> personal style guide would be to always use m%% and s%%%. It may be
AK> ugly but it's consistent; I can always find patterns in my code by
AK> searching for m% or s%. Try that with //! The other problem with // is
AK> that half the time there are slashes in the pattern, and often the
AK> pattern changes in such a way that you either can or have to change
AK> the delimiter. I've never regretted adding this convention to my style
AK> guide, though I'm sure a case could be made for preferring some other
AK> inert character like s,,,.

this isn't religion. there are plenty of religious wars on the net. this
is about quality code. readability matters. it makes your code easier to
understand, maintain, modify, etc. you may think this is cute and your
style but most others don't. someone else posted the same thing. as for
having / in the data, there is no way it is used that often. and regexes
don't usually change on the fly too often. finally the best (and this
comes from damian conway, among many others) alternate delimiter is
{}. it allows nested pairs of {} inside without escaping, it is easy to
see, it doesn't hide the regex itself. and you can even split the two
parts of s{} {} with spaces and it works.


AK> Use of $_ was deliberate here because I gather it's marginally faster
AK> and I was trying to tune the loop for speed. But in general I agree
AK> that named variables are preferable.

not fast enough to be worth it. we are talking minute speedups and your
code has other issues. readibility and safety easily trump the miniscule
speedup you may get in this case.

AK> Style ... aesthetics ... religion ...

wrong, wronger and wrongest.

uri
 
C

C.DeRykus

Style ... aesthetics ... religion ... I decided years ago that my
personal style guide would be to always use m%% and s%%%. It may be
...


You know of course religious threads will go on and on :)

The non-denominational /x switch can help with 's' pattern
readability:

$text =~ s% $_ %$name%gx;

And bracketing delimiters such as (),<>,{},[] can help in
seeing pattern/replacement parts if you're ever tempted to
stray from The True Way:


$test =~ s{ $_ } {$name}gx;

and can even be on multiple lines:

$test =~ s[ really long pattern ]
[some_replacement]gx;
 
A

Adam Kellas

this isn't religion. there are plenty of religious wars on the net. this
is about quality code. readability matters. it makes your code easier to
understand, maintain, modify, etc. you may think this is cute and your
style but most others don't. someone else posted the same thing. as for
having / in the data, there is no way it is used that often. and regexes
don't usually change on the fly too often. finally the best (and this
comes from damian conway, among many others) alternate delimiter is
{}. it allows nested pairs of {} inside without escaping, it is easy to
see, it doesn't hide the regex itself. and you can even split the two
parts of s{}  {} with spaces and it works.

I don't consider it cute, I consider it a convention. The value of a
convention or standard is independent of its objective right-ness.
Perhaps if I could go back in time I'd make some different choices,
but wouldn't we all? Though how you can argue that % is ugly while /
is not while explicitly rejecting the notion that it's an aesthetic
disagreement - and that "quality code" belongs to your side - is
beyond me. You might as well argue that yellow is uglier than blue.

I don't object to this line of discussion though. Perhaps you will
save some naive young person from the horrors of addiction to percent
signs. For me it's 15 years and thousands of lines of maintained code
too late.
  AK> Use of $_ was deliberate here because I gather it's marginally faster
  AK> and I was trying to tune the loop for speed. But in general I agree
  AK> that named variables are preferable.

not fast enough to be worth it. we are talking minute speedups and your
code has other issues. readibility and safety easily trump the miniscule
speedup you may get in this case.

Really, we have no disagreement here. It's just that when one is
trying desperately to speed something up it's reasonable to try
everything you know of. What I posted was not a published, supported
piece of code, it was the result of a tuning exercise.

AK
 
U

Uri Guttman

AK> I don't consider it cute, I consider it a convention. The value of a
AK> convention or standard is independent of its objective right-ness.
AK> Perhaps if I could go back in time I'd make some different choices,
AK> but wouldn't we all? Though how you can argue that % is ugly while /
AK> is not while explicitly rejecting the notion that it's an aesthetic
AK> disagreement - and that "quality code" belongs to your side - is
AK> beyond me. You might as well argue that yellow is uglier than blue.

but convention implies many people doing it. you are the only one i have
ever seen (and i see tons of code) using % for a delimiter. so you are
not conventional but instead out of the norm. it is your convention and
others find it hard to read. you write code for others, not
yourself. get over yourself with this. you are wrong in conventional
ways.

AK> I don't object to this line of discussion though. Perhaps you will
AK> save some naive young person from the horrors of addiction to percent
AK> signs. For me it's 15 years and thousands of lines of maintained code
AK> too late.

you can always change style. i have many times over my career as i learn
better ways to code.

AK> Really, we have no disagreement here. It's just that when one is
AK> trying desperately to speed something up it's reasonable to try
AK> everything you know of. What I posted was not a published, supported
AK> piece of code, it was the result of a tuning exercise.

desperation is not a reason to try everything under the sun for
speedups. just applying the benchmark module is a saner way to find out
what is actually faster and by how much.

uri
 
A

Adam Kellas

  AK> Really, we have no disagreement here. It's just that when one is
  AK> trying desperately to speed something up it's reasonable to try
  AK> everything you know of. What I posted was not a published, supported
  AK> piece of code, it was the result of a tuning exercise.

desperation is not a reason to try everything under the sun for
speedups. just applying the benchmark module is a saner way to find out
what is actually faster and by how much.

OK, if you insist, we do have a disagreement. I will simply note that
the benchmark module is of little use when the code you're fixing is
so slow as to never finish.

AK
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top