re.sub does not replace all occurences

Christoph Krammer · Aug 7, 2007

Hello everybody,

I wanted to use re.sub to strip all HTML tags out of a given string. I
learned that there are better ways to do this without the re module,
but I would like to know why my code is not working. I use the
following:

def stripHtml(source):
source = re.sub("[\n\r\f]", " ", source)
source = re.sub("<.*?>", "", source, re.S | re.I | re.M)
source = re.sub("&(#[0-9]{1,3}|[a-z]{3,6});", "", source, re.I)
return source

But the result still has some tags in it. When I call the second line
multiple times, all tags disappear, but since HTML tags cannot be
overlapping, I do not understand this behavior. There is even a
difference when I omit the re.I (IGNORECASE) option. Without this
option, some tags containing only capital letters (like </FONT>) were
kept in the string when doing one processing run but removed when
doing multiple runs.

Perhaps anyone can tell me why this regex is behaving like this.

Thanks and regards,
Christoph

Marc 'BlackJack' Rintsch · Aug 7, 2007

Hello everybody,

I wanted to use re.sub to strip all HTML tags out of a given string. I
learned that there are better ways to do this without the re module,
but I would like to know why my code is not working. I use the
following:

def stripHtml(source):
source = re.sub("[\n\r\f]", " ", source)
source = re.sub("<.*?>", "", source, re.S | re.I | re.M)
source = re.sub("&(#[0-9]{1,3}|[a-z]{3,6});", "", source, re.I)
return source

But the result still has some tags in it. When I call the second line
multiple times, all tags disappear, but since HTML tags cannot be
overlapping, I do not understand this behavior. There is even a
difference when I omit the re.I (IGNORECASE) option. Without this
option, some tags containing only capital letters (like </FONT>) were
kept in the string when doing one processing run but removed when
doing multiple runs.

Can you give some example HTML where it fails?

Ciao,
Marc 'BlackJack' Rintsch

Neil Cerutti · Aug 7, 2007

Hello everybody,

I wanted to use re.sub to strip all HTML tags out of a given string. I
learned that there are better ways to do this without the re module,
but I would like to know why my code is not working. I use the
following:

def stripHtml(source):
source = re.sub("[\n\r\f]", " ", source)
source = re.sub("<.*?>", "", source, re.S | re.I | re.M)
source = re.sub("&(#[0-9]{1,3}|[a-z]{3,6});", "", source, re.I)
return source

But the result still has some tags in it. When I call the
second line multiple times, all tags disappear, but since HTML
tags cannot be overlapping, I do not understand this behavior.
There is even a difference when I omit the re.I (IGNORECASE)
option. Without this option, some tags containing only capital
letters (like </FONT>) were kept in the string when doing one
processing run but removed when doing multiple runs.

Perhaps anyone can tell me why this regex is behaving like
this.

Help on function sub in module re:

sub(pattern, repl, string, count=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a callable, it's passed the match object and must return
a replacement string to be used.

And from the Python Library Reference for re.sub:

The pattern may be a string or an RE object; if you need to
specify regular expression flags, you must use a RE object,
or use embedded modifiers in a pattern; for example,
"sub("(?i)b+", "x", "bbbb BBBB")" returns 'x x'.

The optional argument count is the maximum number of pattern
occurrences to be replaced; count must be a non-negative
integer. If omitted or zero, all occurrences will be
replaced. Empty matches for the pattern are replaced only
when not adjacent to a previous match, so "sub('x*', '-',
'abc')" returns '-a-b-c-'.

In other words, the fourth argument to sub is count, not a set of
re flags.

Christoph Krammer · Aug 7, 2007

Neil said:
In other words, the fourth argument to sub is count, not a set of
re flags.

I knew it had to be something very stupid.

Thanks a lot.

Double replace or single re.sub?	7	Oct 26, 2005
[ActivePython 2.5.1.1] Why does Python not return first line?	5	Mar 16, 2009
Jquery not triggering / acting as expected.	0	Mar 6, 2022
FLV download script works, but I want to enhance it	3	May 6, 2009
How can Python print the value of an attribute but complain it does not exist?	1	Oct 10, 2007
IE7 Does Not Support Some Unicode?	3	Jul 26, 2007
No-syntax Web-programming-IDE (was: Does turtle graphics have the wrong associations?)	0	Nov 22, 2009
HelloWoldService.asmx Invoke test does not return results	4	Apr 21, 2006

re.sub does not replace all occurences

Christoph Krammer

Marc 'BlackJack' Rintsch

Neil Cerutti

Christoph Krammer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads