What is built-in method sub

Jeremy · Jan 11, 2010

I just profiled one of my Python scripts and discovered that >99% of
the time was spent in

{built-in method sub}

What is this function and is there a way to optimize it?

Thanks,
Jeremy

Carl Banks · Jan 11, 2010

I just profiled one of my Python scripts and discovered that >99% of
the time was spent in

{built-in method sub}

What is this function and is there a way to optimize it?

I'm guessing this is re.sub (or, more likely, a method sub of an
internal object that is called by re.sub).

If all your script does is to make a bunch of regexp substitutions,
then spending 99% of the time in this function might be reasonable.
Optimize your regexps to improve performance. (We can help you if you
care to share any.)

If my guess is wrong, you'll have to be more specific about what your
sctipt does, and maybe share the profile printout or something.

Carl Banks

Matthew Barnett · Jan 11, 2010

Jeremy said:
I just profiled one of my Python scripts and discovered that >99% of
the time was spent in

{built-in method sub}

What is this function and is there a way to optimize it?

Thanks,
Jeremy

MRAB · Jan 11, 2010

Jeremy said:
I just profiled one of my Python scripts and discovered that >99% of
the time was spent in

{built-in method sub}

What is this function and is there a way to optimize it?

I think it's the subtraction operator. The only way to optimise it is to
reduce the number of subtractions that you do!

Jeremy · Jan 11, 2010

I'm guessing this is re.sub (or, more likely, a method sub of an
internal object that is called by re.sub).

If all your script does is to make a bunch of regexp substitutions,
then spending 99% of the time in this function might be reasonable.
Optimize your regexps to improve performance. (We can help you if you
care to share any.)

If my guess is wrong, you'll have to be more specific about what your
sctipt does, and maybe share the profile printout or something.

Carl Banks

Your guess is correct. I had forgotten that I was using that
function.

I am using the re.sub command to remove trailing whitespace from lines
in a text file. The commands I use are copied below. If you have any
suggestions on how they could be improved, I would love to know.

Thanks,
Jeremy

lines = self._outfile.readlines()
self._outfile.close()

line = string.join(lines)

if self.removeWS:
# Remove trailing white space on each line
trailingPattern = '(\S*)\ +?\n'
line = re.sub(trailingPattern, '\\1\n', line)

Diez B. Roggisch · Jan 11, 2010

Jeremy said:
Your guess is correct. I had forgotten that I was using that
function.

I am using the re.sub command to remove trailing whitespace from lines
in a text file. The commands I use are copied below. If you have any
suggestions on how they could be improved, I would love to know.

Thanks,
Jeremy

lines = self._outfile.readlines()
self._outfile.close()

line = string.join(lines)

if self.removeWS:
# Remove trailing white space on each line
trailingPattern = '(\S*)\ +?\n'
line = re.sub(trailingPattern, '\\1\n', line)

line = line.rstrip()?

Diez

Philip Semanchuk · Jan 11, 2010

Yep. I was trying to reinvent the wheel. I just remove the trailing
whitespace before joining the lines.

I second the suggestion to use rstrip(), but for future reference you
should also check out the compile() function in the re module. You
might want to time the code above against a version using a compiled
regex to see how much difference it makes.

Cheers
Philip

Diez B. Roggisch · Jan 11, 2010

Philip said:
I second the suggestion to use rstrip(), but for future reference you
should also check out the compile() function in the re module. You might
want to time the code above against a version using a compiled regex to
see how much difference it makes.

For his usecase, none. There is a caching build-in into re that will
take care of this.

Diez

Chris Rebert · Jan 11, 2010

On Mon, Jan 11, 2010 at 12:34 PM, Steven D'Aprano

If you can avoid regexes in favour of ordinary string methods, do so. In
general, something like:

source.replace(target, new)

will potentially be much faster than:

regex = re.compile(target)
regex.sub(new, source)
# equivalent to re.sub(target, new, source)

(assuming of course that target is just a plain string with no regex
specialness). If you're just cracking a peanut, you probably don't need
the 30 lb sledgehammer of regular expressions.

Of course, but is the regex library really not smart enough to
special-case and optimize vanilla string substitutions?

Cheers,
Chris

Steven D'Aprano · Jan 11, 2010

On Mon, Jan 11, 2010 at 12:34 PM, Steven D'Aprano

Of course, but is the regex library really not smart enough to
special-case and optimize vanilla string substitutions?

Apparently not in Python 2.5:

Inquisition!")',
.... 'from re import compile; x = compile("Spanish")')

t2 = Timer('x.replace("Spanish", "Dutch")', .... 'x="Nobody expects the Spanish Inquisition!"')

t1.repeat() [3.7209370136260986, 2.7262279987335205, 2.6416280269622803]
t2.repeat()

Click to expand...

Click to expand...

[2.2915709018707275, 1.2584249973297119, 1.2730350494384766]

Even if it did, I wouldn't rely on that sort of special casing unless the
language guaranteed it. Keep in mind that regexes are essentially a
programming language (although not Turing Complete), and the engine
implementation may choose purity and simplicity over such optimizations.

John Machin · Jan 12, 2010

Yep. I was trying to reinvent the wheel. I just remove the trailing
whitespace before joining the lines.

Actually you don't do that. Your regex has three components:

(1) (\S*) zero or more occurrences of not-whitespace
(2) \ +? one or more (non-greedy) occurrences of SPACE
(3) \n a newline

Component (2) should be \s+?

In any case this is a round-about way of doing it. Try writing a regex
that does it simply: replace trailing whitespace by an empty string.

Another problem with your approach: it doesn't work if the line is not
terminated by \n -- this is quite possible if the lines are being read
from a file.

A wise person once said: Re-inventing the wheel is often accompanied
by forgetting to re-invent the axle.

Phlip · Jan 12, 2010

trailingPattern = '(\S*)\ +?\n'
What happens with this?

trailingPattern = '\s+$'
line = re.sub(trailingPattern, '', line)

I'm guessing that $ terminates \s+'s greediness without snarfing the underlying
\n. Then I'm guessing that the lack of a \1 replacer will help the sub work
faster with less internal string shuffling.

is probably faster still, but there might be a technical reason to avoid it.

But these uncertainties are why I write unit tests, including tests for the edge
cases. (What if it's a \r\n? What if the \n is missing? etc.) That way I don't
need to memorize re's exact behavior, and if I find a reason to swap in a
..rstrip(), I can pass all the tests and make sure the substitution works the same.

What is Programming?	4	Aug 9, 2024
Trying to creade method .between()	3	Sep 24, 2023
Understanding an exercise about sub-sums in arrays	1	May 20, 2023
PEP/GSoC idea: built-in parser generator module for Python?	0	Mar 14, 2014
What is AI programming to us non-bigtech programmers?	4	Jun 1, 2023
"Don't rebind built-in names*" - it confuses readers	20	Jun 11, 2013
py_compile vs. built-in compile, with __future__	7	Jun 10, 2013
What should I do Before I give up programming?	6	Jan 14, 2023

What is built-in method sub

Jeremy

Carl Banks

Matthew Barnett

MRAB

Jeremy

Diez B. Roggisch

Philip Semanchuk

Diez B. Roggisch

Chris Rebert

Steven D'Aprano

John Machin

Phlip

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads