Since you are asking this question, it is not clear to you at all April.
you really know me, however with your inspiration, I'm pretty sure
I'll be getting better sooner.
Look at that expression. '[\w.-]+@' is a hard anchor. That must be satisfied
first, especially the '@'.
not sure I agree with this and the following Church ranking thing ...
The fact is '[\w.-]+' can be satisfied with a single character.
agree.
The other fact is '.*?' can get by with one character but it is in the bottom
of precedence.
believe '.*?' can get by with 0 character too.
Yes, it will take no character, its a filter, but its a one character
at a time filter.
However '.*' wants to take as much of the string as possible,
agree, but will still check to see whether that will allow the
following [\w.-] to be satisfied.
Yes, but '[\w.-]+' can be satisfied with a single character.
Thus '.*' will grab all before that single character in a greedy fashion.
This will take as long as the non-greedy but will not get the right results.
Here is the heirchy from top down:
1 - '@' is GOD
2 - '[\w.-]+' is CHRIST
3 - '.*' is the greedy HOLY GHOST
4 - '.*?' is the single ANGEL
[snip]
Just foud and read "Regular Expression Tutorial Part 5: Greedy and Non-
Greedy Quantification" by Andrew Johnson (which can be found on the
Internet by searching). Andrew provides a pretty convencing
explanation on how '.*?' works. I believe the use of '.*?' will take
care of no space, one or more other characters, including space, tab,
etc., that appear before the real email address but are not matched
by [\w.-].
I never read that book. Its probably good.
In my experience, negative greed is one the most usefull concepts.
I always look to add negative greed to expressions.
In terms of greed, once the engine knows what not to look for, it will
grab all it can up to that point. Then it will look at the next term
in the regex expression.
This is the same as non-greedy, but the greedy one grabs a chunk of
matched data at a time, where as the non-greedy will grab one occurance
at a time. They both then check the next term for a match.
Knowing this, you can shorten the time the data takes to process.
Example:
$data = "From: ]]]][[[[*****\\ -2ame\@yahoo.com";
$data =~ /^From:[^\w.-]*([\w.-]+@[\w.-]+)/
is about %130 faster (2-3x faster) than this
$data =~ /^From:.*?([\w.-]+@[\w.-]+)/
The reason is that the engine grabs the greedy chunk first.
It just so happens we stopped the greed at a boundry where
the next character \w satisfies the next term '[\w-]+'.
Non-greedy will only get one character at a time between checks if the
next character will satisfy the next term '[\w-]+'. The repeated itteration
consumes a very large chunk of processing time.
The more a non-greedy term has to process the longer it takes. It could be
non-linear as well, not sure.
If the above were '$data = "From: -2ame\@yahoo.com";', the processing time's
would be equal. The more '.*?' characters, the longer time it takes.
There are times when you don't know where exactly to stop the greed,
but by all means possible, try to let the greed be there. Just have
to think about it and test all possible scenario's.
In a looping scenario, say like a parser, where everything is processed in a
repeating fashion, there is usually a sink/filter that picks up waste/comments
or formatting data, typically takes on the '.*?' form. This typically gives
the patterns a chance to match on the next character.
If there is 1,2 or 3 characters that start out the pattern matches, a greedy
term (negative) can take you up to them quickly, giving the pattern's a chance
to match without checking at character intervals. In that case you can use negative
greed and just simply have to know how to get past those characters in case the
patterns don't match.
Typically:
$lcbpos = 0
while (/<($pat1|$pat2|$pat3)>|([^<]*)(<?)/g) {
if (defined $2) {
if (length($3) && $lcbpos != pos($_)) {
$lcbpos = pos($_);
pos($_) = $lcbpos - 1;
}
next;
}
# found pattern
So negative greed is a good thing indeed. Its advisable to always try to be greedy.
But, this is un-avoidable sometimes: /ANCHOR's.*?AWAY/
Here are some benchmarks concerning greed and your email regexp.
--------------------------
use strict;
use warnings;
use Benchmark ':hireswallclock';
my $email = "From: ]]]][[[[*****\\ -2ame\@yahoo.com";
my ($result,$t0,$t1,$tdif) = '';
### Non-Greedy '.*?'
$t0 = new Benchmark;
for (1 .. 10000)
{
$email =~ /^From:.*?([\w.-]+@[\w.-]+)/;
}
$t1 = new Benchmark;
$tdif = timediff($t1, $t0);
print "\nNon-greedy '.*?' --\n the code took:",timestr($tdif),"\n";
### Greedy '[^\\w.-]*'
$t0 = new Benchmark;
for (1 .. 10000)
{
$email =~ /^From:[^\w.-]*([\w.-]+@[\w.-]+)/;
}
$t1 = new Benchmark;
$tdif = timediff($t1, $t0);
print "\nGreedy '[^\\w.-]*' --\n the code took:",timestr($tdif),"\n";
__END__
Non-greedy '.*?' --
the code took:0.03332 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU)
Greedy '[^\w.-]*' --
the code took:0.016902 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU)