Newbie question: "Get substring of line"

  • Thread starter Petterson Mikael
  • Start date
M

Matt Garrish

Michele Dondi said:
(my $str=$_) =~ s/.*=//; # e.g.

Isn't that a little more work than is necessary? (And perhaps a bit greedy,
though it's hard to say.)

my ($str) = /[^=]+=(.*)/;

Matt
 
J

jl_post

Uri Guttman replied:
and it is not commonly used as well. so why did you
even bother to mention it?


Since you asked, I'll tell you:

Have you ever tried teaching Perl to someone who just didn't have a
good grasp of regular expressions? To someone who sort of understood
them, but couldn't get his mind around the fact that the regular
expression "[fee|fie|foe]" is identical to "[feio|]"? Or that, no
matter how manly times you tell him (and get him to agree) that "*"
means "match zero or more occurrances of", he still thinks it's wrong
that ".*" can successfully match an empty string?

I've dealt with people like this. And I've found that, in many
cases, instead of spending five or ten minutes explaining to them why
the use of $`, $&, and $' is a bad idea and get them to write through
hoops avoiding their use (hoops for them, not for me), it's just best
to teach them the simpler solution first and let them learn that. The
milliseconds of run time they waste from running their script with the
"poorer" solution more than makes up for the time lost writing their
code in a way that is not understood very well, finding/correcting bugs
that may result, and the additional time explaining how to "do it
right."

I personally think that the line:

$string = $' if "abc=xyz" =~ m/=/;

is easier to read and understand than the line:

$string = $1 if "abc=xyz" =~ m/=(.*)$/;

I have a feeling that you might disagree. Whatever the case, I've
found that, even when a beginnier Perl programmer understands each of
the symbols in the pattern-match "m/=(.*)$/", he can still have a
difficult time putting them all together to deduce the purpose of the
entire match. In those cases, I've found that it's good to start out
simple (like "m/=/") and then expand to a more complex explanation if
necessary.

As for the performance penalty of using $`, $&, and $', I believe
that it's not as bad as most people think. The performance penalty
does not have a behavior of N-SQUARED (measured in Big-O notation). I
tend to think (but I could be wrong) that it's more on the order of N,
which isn't really all that bad in the big scheme of things. You may
save millisends running your script if you remove all instances of $`,
$&, and $', but it's not going to remove any bottleneck that's taking
too much processor time. In fact, I'd be surprised if you saved over
ten seconds by running a script (with $`, $&, and $' removed)
repeatedly for one whole day straight. In fact, I'm not aware of
anyone who has ever removed $`, $&, and $' from his scripts to find
that the scripts ran noticeably faster. (And if there's evidence of
this to the contrary, I'd be interested in knowing about it.)

And I've wondered: just how much IS the performance penalty,
anyway? I decided to perform a Benchmark test:


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;
my $count = 1e7;
my $bad = q!$string = $' if "abc=xyz" =~ m/=/!;
my $good = q!$string = $1 if "abc=xyz" =~ m/=(.*)$/!;
timethese($count, {bad => $bad, good => $good})
__END__


The results surprised me. I got the following output:
Benchmark: timing 10000000 iterations of bad, good...
bad: 18 wallclock secs (16.42 usr + -0.03 sys = 16.39 CPU)
@ 610090.90/s
good: 22 wallclock secs (20.73 usr + 0.00 sys = 20.73 CPU)
@ 482299.60/s


Apparently, the "bad" code (with $') ran faster than the "good" code!
This seems strange considering that use of $' is supposed to incur a
performance penalty, not an optimization. Thinking about this, I would
come to the conclusion that the penalty mainly happens on the regular
expressions that don't use $`, $&, or $' in that they use extra
processor time figuring out these variables when they don't need them.
In other words, if every single regular expression used $', there would
be no real performance penalty at all.

If I'm right with that reasoning, then that would actually make the
regular expression that uses $' the preferred choice for one-line Perl
scripts.

So it looks like there is a reason to learn $' after all. And even
if your script runs slightly slower as a result of using it, I doubt
you'll ever notice the difference.

nor did you tell the OP where to learn about regexes or whatever.

It's often difficult to tell the tone of a response in a plain-text
message, but it seems like you're irritated at me for some reason, Uri.
I don't know whether you are irritated at me personally or at just my
post, but if I said something that offended you, then I apologize.

It's just that when I read a message like yours where first I'm
criticized for putting in too much information and then criticized for
not putting enough in, I get the impression that it's not my post you
are upset at, but at me personally.

When I post messages to UseNet, I usually post messages that I think
will be helpful to a user, depending on what level of experience I
think he's at. Obviously, if I think he's a beginner, the information
I post will probably not be helpful to an advanced programmer. Of
course, I could be wrong about what is helpful and what is not, but
that's the beauty of UseNet -- everyone is free to post what they want.

Again, Uri, I'm sorry if I offended you in this post or in an
earlier post.

-- Jean-Luc
 
M

Matt Garrish

And I've wondered: just how much IS the performance penalty,
anyway? I decided to perform a Benchmark test:


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;
my $count = 1e7;
my $bad = q!$string = $' if "abc=xyz" =~ m/=/!;
my $good = q!$string = $1 if "abc=xyz" =~ m/=(.*)$/!;
timethese($count, {bad => $bad, good => $good})
__END__


The results surprised me. I got the following output:

That's not a useful benchmark. You'd need to benchmark the two in different
scripts in order to avoid the time penalty on the grouping regex incurred
because you've used $` in the first. It's also not the most optimal way to
group a pattern. Please recheck perlre:

<quote>
WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in the
program, it has to provide them for every pattern match. This may
substantially slow your program. Perl uses the same mechanism to produce $1,
$2, etc, so you also pay a price for each pattern that contains capturing
parentheses. (To avoid this cost while retaining the grouping behaviour, use
the extended regular expression (?: ... ) instead.)
</quote>

Matt
 
J

jl_post

Matt Garrish replied:
That's not a useful benchmark. You'd need to benchmark
the two in different scripts in order to avoid the time
penalty on the grouping regex incurred because you've
used $` in the first.


That thought did occur to me before I posted that code. For
curiosity's sake, I ran separate copies of the script, each one only
running one test line. These are my results:

The "bad" code:
33 wallclock secs (31.89 usr + 0.01 sys = 31.90 CPU)
@ 31343

The "good" code:
45 wallclock secs (43.69 usr + 0.00 sys = 43.69 CPU)
@ 22890

(If you are wondering why the results are so much different than the
ones in the previous post, it's because I ran this code on a different
computer.)

As you can see, the "bad" code still is faster than the "good" code.
Besides, those code snippets are eval'ed (according to "perldoc
Benchmark"), so I don't think the "good" code suffers the penalty the
"bad" code introduces (but I could be wrong about that). Either way,
as a one-liner, the above code that uses $' is more efficient than the
above code that does not.

-- Jean-Luc
 
U

Uri Guttman

jpc> That thought did occur to me before I posted that code. For
jpc> curiosity's sake, I ran separate copies of the script, each one only
jpc> running one test line. These are my results:

jpc> The "bad" code:
jpc> The "good" code:
jpc> (If you are wondering why the results are so much different than the
jpc> ones in the previous post, it's because I ran this code on a different
jpc> computer.)

jpc> As you can see, the "bad" code still is faster than the "good" code.
jpc> Besides, those code snippets are eval'ed (according to "perldoc
jpc> Benchmark"), so I don't think the "good" code suffers the penalty the
jpc> "bad" code introduces (but I could be wrong about that). Either way,
jpc> as a one-liner, the above code that uses $' is more efficient than the
jpc> above code that does not.

you still don't get the issue. that benchmark is totally useless to
illustrate the problem. $' causes ALL OTHER regexes to copy their entire
string in case some code somewhere references $' or friends. the actual
regex that uses $' might be faster which proves a single use of $' may
be faster than $1. the penalty lies elsewhere. please learn to properly
analyze a problem and how to properly isolate it in a benchmark.

do you think that caveat would be there if it was not true? or that
avoiding $' would stay in the docs and perl culture for so long? or that
avoiding use English would be encouraged as it uses $'? or that
English.pm would have been modified to allow it to not export aliases
for $'? of course you know all of that!!

when you have calmed down from your snit after you read this, try to
actually isolate the issue in a proper benchmark. spend some deep time
in thought in how to do it. post your benchmark for review. then talk
about $' with some confidence. until then, please don't refer newbies to
use $' as it is still bad for speed and i still think it is not good
code. it is a special case variable that doesn't extend well to multiple
grabs. it doesn't do anything special in list context as explicit grabs
do. it is another special variable to explain (actually 3). they can be
trivially emulated with explicit grabs. perl6 has dropped them. do you
need any more reasons why they are bad?

uri
 
M

Michele Dondi

my $bad = q!$string = $' if "abc=xyz" =~ m/=/!;
my $good = q!$string = $1 if "abc=xyz" =~ m/=(.*)$/!; [snip]
The results surprised me. I got the following output:
Benchmark: timing 10000000 iterations of bad, good...
bad: 18 wallclock secs (16.42 usr + -0.03 sys = 16.39 CPU)
@ 610090.90/s
good: 22 wallclock secs (20.73 usr + 0.00 sys = 20.73 CPU)
@ 482299.60/s

Apparently, the "bad" code (with $') ran faster than the "good" code!

This is not that surprising:

(from 'perldoc perlre')

| WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in
| the program, it has to provide them for every pattern match. This may
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

| substantially slow your program. Perl uses the same mechanism to produce

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

| $1, $2, etc, so you also pay a price for each pattern that contains
^^^^^^
^^^^^^
So it looks like there is a reason to learn $' after all. And even
if your script runs slightly slower as a result of using it, I doubt
you'll ever notice the difference.

I think that the issues people tend to have with $' et similia only
partially have to do with optimization/performance matters.


Michele
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,164
Messages
2,570,898
Members
47,440
Latest member
YoungBorel

Latest Threads

Top