re.match() performance

E

Emanuele D'Arrigo

Sorry for the previous post, hit the Enter button by mistake... here's
the complete one:

Hi everybody!

I've written the code below to test the differences in performance
between compiled and non-compiled regular expression matching but I
don't quite understand the results. It appears that the compiled the
pattern only takes 2% less time to process the match. Is there some
caching going on in the uncompiled section that prevents me from
noticing its otherwise lower speed?

Manu


------------

import re
import time

## Setup
pattern = "<a>(.*)</a>"
compiledPattern = re.compile(pattern)

longMessage = "<a>"+ "a" * 100000 +"</a>"

numberOfRuns = 1000

## TIMED FUNCTIONS
startTime = time.clock()
for i in range(0, numberOfRuns):
re.match(pattern, longMessage)
patternMatchingTime = time.clock() - startTime

startTime = time.clock()
for i in range(0, numberOfRuns):
compiledPattern.match(longMessage)
compiledPatternMatchingTime = time.clock() - startTime

ratioCompiledToNot = compiledPatternMatchingTime / patternMatchingTime

## PRINT OUTS
print("")
print(" Pattern Matching Time: " + str(patternMatchingTime))
print("(Compiled) Pattern Matching Time: " + str
(compiledPatternMatchingTime))
print("")
print("Ratio Compiled/NotCompiled: " + str(ratioCompiledToNot))
print("")
 
D

Diez B. Roggisch

Emanuele said:
Sorry for the previous post, hit the Enter button by mistake... here's
the complete one:

Hi everybody!

I've written the code below to test the differences in performance
between compiled and non-compiled regular expression matching but I
don't quite understand the results. It appears that the compiled the
pattern only takes 2% less time to process the match. Is there some
caching going on in the uncompiled section that prevents me from
noticing its otherwise lower speed?

Yes. There is even a purge-function to clear that cache, for whatever
reason.

To answer that question yourself, you could have taken a look into the
python library, it's not as scary as you might think :)

Diez
 
M

MRAB

Emanuele said:
Sorry for the previous post, hit the Enter button by mistake... here's
the complete one:

Hi everybody!

I've written the code below to test the differences in performance
between compiled and non-compiled regular expression matching but I
don't quite understand the results. It appears that the compiled the
pattern only takes 2% less time to process the match. Is there some
caching going on in the uncompiled section that prevents me from
noticing its otherwise lower speed?
[snip]

Yes, the regular expression is compiled and cached internally.
 
P

Peter Otten

Emanuele said:
I've written the code below to test the differences in performance
between compiled and non-compiled regular expression matching but I
don't quite understand the results. It appears that the compiled the
pattern only takes 2% less time to process the match. Is there some
caching going on in the uncompiled section that prevents me from
noticing its otherwise lower speed?
Yes:
{(<class 'str'>, 'yadda', 0): <_sre.SRE_Pattern object at 0x2ac6e66e9e70>}

Hint: questions like this are best answered by the source code, and Python
is open source. You don't even have to open an editor:
def match(pattern, string, flags=0):
"""Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found."""
return _compile(pattern, flags).match(string)
def _compile(*key):
# internal: compile pattern
cachekey = (type(key[0]),) + key
p = _cache.get(cachekey)
if p is not None:
return p
pattern, flags = key
if isinstance(pattern, _pattern_type):
if flags:
raise ValueError(
"Cannot process flags argument with a compiled pattern")
return pattern
if not sre_compile.isstring(pattern):
raise TypeError("first argument must be string or compiled pattern")
p = sre_compile.compile(pattern, flags)
if len(_cache) >= _MAXCACHE:
_cache.clear()
_cache[cachekey] = p
return p


Peter
 
P

Pierre-Alain Dorange

Emanuele D'Arrigo said:
I've written the code below to test the differences in performance
between compiled and non-compiled regular expression matching but I
don't quite understand the results. It appears that the compiled the
pattern only takes 2% less time to process the match. Is there some
caching going on in the uncompiled section that prevents me from
noticing its otherwise lower speed?

Running your sample i got also a 2% the first time, but next time i got
a different speed : 4 time faster.

Running 1st time
Pattern Matching Time: 0.122432
(Compiled) Pattern Matching Time: 0.12012
Ratio Compiled/NotCompiled: 0.981116048092

2nd time and more
Pattern Matching Time: 0.00257
(Compiled) Pattern Matching Time: 0.000619
Ratio Compiled/NotCompiled: 0.240856031128


Config python 2.5.1 / MacOS X 10.5
 
S

Steven D'Aprano

I've written the code below to test the differences in performance ....
## TIMED FUNCTIONS
startTime = time.clock()
for i in range(0, numberOfRuns):
re.match(pattern, longMessage)
patternMatchingTime = time.clock() - startTime
....

You probably don't need to re-invent the wheel. See the timeit module. In
my opinion, the best idiom for timing small code snippets is:


from timeit import Timer
t = Timer("func(arg)", "from __main__ import func, arg")
time_taken = min(t.repeat(number=N))/N

where N will depend on how patient you are, but probably shouldn't be
less than 100. For small enough code snippets, the default of 1000000 is
recommended.

For testing re.match, I didn't have enough patience for one million
iterations, so I used ten thousand.

My results were:
t1 = Timer("re.match(pattern, longMessage)", .... "from __main__ import pattern, re, compiledPattern, longMessage")
t2 = Timer("compiledPattern.match(longMessage)", .... "from __main__ import pattern, re, compiledPattern, longMessage")
t1.repeat(number=10000) [3.8806509971618652, 3.4309241771697998, 4.2391560077667236]
t2.repeat(number=10000)
[3.5613579750061035, 2.725193977355957, 2.936690092086792]


which were typical over a few runs. That suggests that even with no
effort made to defeat caching, using pre-compiled patterns is
approximately 20% faster than re.match(pattern).

However, over 100,000 iterations that advantage falls to about 10%. Given
that each run took about 30 seconds, I suspect that the results are being
contaminated with some other factor, e.g. networking events or other
processes running in the background. But whatever is going on, 10% or
20%, pre-compiled patterns are slightly faster even with caching --
assuming of course that you don't count the time taken to compile it in
the first place.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,698
Latest member
LydiaHalle

Latest Threads

Top