Robert said:
This has a potential for disastrous backtracking with large strings.
This one is better - if you can guarantee there there is no "%" besides
the one preceding the "2F":
=> "2006%2F10%2Fasdfasdf"
s.match(/^([^%]*)%2F([^%]*)%2F(.*)$/).to_a
=> ["2006%2F10%2Fasdfasdf", "2006", "10", "asdfasdf"]
Or maybe even
s.match(/^((?>[^%]*))%2F((?>[^%]*))%2F((?>.*))$/).to_a
=> ["2006%2F10%2Fasdfasdf", "2006", "10", "asdfasdf"]
I have a couple of questions about this; I'm always trying to further my
(currently basic) understanding of regular expressions.
If you are really interested in the matter I can recommend "Mastering
Regular Expressions". Even I got valuable insights from it although I
would have regarded me "senior" with regard to RX.
1. Why does my first regex have a potential for disastrous backtracking?
(By disastrous I assume you mean inefficient and CPU-time-consuming,
right?)
Correct. The first ".*" will match greedily as far as it can which
means: to the end of the sequence. Then the RX engine (it is a NFA in
the case of Ruby) detects that it cannot get an overall match with that
because there is no "%2F" following. So it starts backing up by
stepping back one character and trying the "%2F" again etc. This will
go until the first group matches "2006%2F10". Ah, now we can match the
first "%2F" in the pattern. Then comes the next greedy ".*" and the
game starts over again with that. Match to the end, then try to back
up. Eventually the engine will find out that with the first group
eating up the first "%2F" as well there is no overall match since in the
remaining portion there is no more "%2F". Then backing up the first
group starts again until the first group's match is reduced to "2006".
2. What does the "?>" do in your second regex? I haven't seen that
before.
That's an atomic sub RX. Basically it will not give back any characters
that it has consumed. Using that in this example with ".*" will make
the overall match fail:
=> []
Actually I believe atomic grouping is not needed in this case as the
[^%] cannot match past a "%" and so there is probably no potential for
backtracking. Benchmarking probably shows the whole picture. It is
definitively harmful with ".*" because then the backtracking (see above)
cannot start and there will be no overall match.
You can easily see the backtracking with a tool like "Regex Coach" with
which you can step graphically through the match.
Kind regards
robert