[newbie] Strange behavior of the re module

Fred · Aug 21, 2004

Hi,

While parsing through a bunch of HTML pages using the latest
ActivePython, I experienced something funny using the re module. I
extracted the part that generates the errors (I'm just trying to
substitute once item with another in a string):

--------------------------------
import re

#NOK : doesn't like a single, ending backslash
#stuff = "\colortbl\red0\green0\"
# => SyntaxError: EOL while scanning single-quoted string

#NOK : doesn't like gn0?

stuff="\colortbl\red0\gn0"

# => traceback (most recent call last):
# File "C:\test.py", line 10, in ?
# template = re.sub('BLA', stuff, template)
# File "G:\Python23\lib\sre.py", line 143, in sub
# return _compile(pattern, 0).sub(repl, string, count)
# File "G:\Python23\lib\sre.py", line 257, in _subx
# template = _compile_repl(template, pattern)
# File "G:\Python23\lib\sre.py", line 244, in _compile_repl
# raise error, v # invalid expression
#sre_constants.error: bad group name

#OK....
stuff="\colortbl\red0\n0"

template = "BLA"

template = re.sub('BLA', stuff, template)
--------------------------------

=> It appears that the re module isn't very friendly with backslashes,
at least on the Windows platform. Does someone know why, and what I
could do, since I can't rewrite the source HTML documents that contain
backslashes.

Thank you
Fred.

Hans Nowak · Aug 21, 2004

Fred said:
stuff="\colortbl\red0\n0"

template = "BLA"

template = re.sub('BLA', stuff, template)
--------------------------------

=> It appears that the re module isn't very friendly with backslashes,
at least on the Windows platform. Does someone know why, and what I
could do, since I can't rewrite the source HTML documents that contain
backslashes.

It's not the re module, it's that backslashes have special meaning in string
literals. See also:

http://docs.python.org/tut/node5.html#SECTION005120000000000000000

http://docs.python.org/ref/strings.html

To use a non-escaping backslash in a string literal, use a double backslash:

stuff = "\\colortbl\\red0\\n0"

or a raw string:

stuff = r"\colortbl\red0\n0"

HTH,

Fred · Aug 21, 2004

To use a non-escaping backslash in a string literal, use a double backslash:

stuff = "\\colortbl\\red0\\n0"

or a raw string:

stuff = r"\colortbl\red0\n0"

Thx Hans for the prompt answer. I'll have to use the second form since
I can't modify the content of the HTML pages I'm looping through...
but no matter which option I use (either r or R), Python is still not
happy:

---------------------------------------
import re

#NOK
stuff=r"\colortbl\red0\gn0"
#NOK
stuff=R"\colortbl\red0\gn0"

template = "BLA"
template = re.sub('BLA', stuff, template)
---------------------------------------

Traceback (most recent call last):
File "C:\test.py", line 9, in ?
template = re.sub('BLA', stuff, template)
File "G:\Python23\lib\sre.py", line 143, in sub
return _compile(pattern, 0).sub(repl, string, count)
File "G:\Python23\lib\sre.py", line 257, in _subx
template = _compile_repl(template, pattern)
File "G:\Python23\lib\sre.py", line 244, in _compile_repl
raise error, v # invalid expression
sre_constants.error: bad group name

Maybe the r/R prefix is not available in ActivePython?

Thanks
Fred.

Tim Peters · Aug 21, 2004

[Fred said:
import re

#NOK
stuff=r"\colortbl\red0\gn0"
#NOK
stuff=R"\colortbl\red0\gn0"

template = "BLA"
template = re.sub('BLA', stuff, template)
---------------------------------------

Traceback (most recent call last):
File "C:\test.py", line 9, in ?
template = re.sub('BLA', stuff, template)
File "G:\Python23\lib\sre.py", line 143, in sub
return _compile(pattern, 0).sub(repl, string, count)
File "G:\Python23\lib\sre.py", line 257, in _subx
template = _compile_repl(template, pattern)
File "G:\Python23\lib\sre.py", line 244, in _compile_repl
raise error, v # invalid expression
sre_constants.error: bad group name

I can't figure out what you're trying to accomplish here, but the
error msg makes sense. You should pause to read the docs for re.sub.
In

re.sub('BLA', stuff, template)

'BLA' is the regular expression, stuff is the substitution pattern,
and template is the input string. As the docs say, \g in the
substitution pattern has special meaning, specifying the name of a
named capturing group. Your regular expression ('BLA') has no
capturing groups (let alone named ones), so using \g in the
substitution pattern can't work.

If you really want to search for the regular expression 'BLA' in
template and replace each occurence with the string

r"\colortbl\red0\gn0"

then you need to escape all characters with special meaning in the
substitution pattern, via re.escape():

Fred · Aug 21, 2004

I can't figure out what you're trying to accomplish here, but the
error msg makes sense.

I'm actually writing a script that extracts parts of HTML pages, but
some pages contain backslashes, which is why the script failed when
massaging those particular pages.

That did it

Thx a bunch.

Fred.

Fred · Aug 21, 2004

That did it Thx a bunch.

Mmmm... The above links and hints did teach more about the infamous
"blackslash plague", but I'm still stuck because all the examples
consider static strings, while I'm building it dynamically by
extracting data from a web page through the re module:

----------------------------------------------
import sys
import re

#1. Extract stuff between BODY tags
input = "<body>c:\temp</body>"
body = re.search('<body.*?>(.*?)</body>',input,re.IGNORECASE |
re.DOTALL)
if body:
body = body.group(1)
print "Body = " + body

#2. Insert extracted stuff into template
output = "<body>here's the path: </body>"
output = re.sub('</body>', body + "</body>", output)
print output
----------------------------------------------

I also tried running this before so that the problem would go away,
but Python doesn't like it either:

body = re.sub(r'\',r'\\',input)

The script does run, but
Fred.

Fred · Aug 21, 2004

The script does run, but

Guess I hit the Send button instead of Save ;-)

OK, for those interested, here's some working code, although it's
pretty slow (2mn30 when massaging a 200KB file on a P3 host):

--------------------
#The goal is to read an HTML file, extract whatever's between <body>
and </body>, read a template file, and insert what we extracted from
the first document:

import sys
import re

fp=open("./mydoc.html")
input = fp.read()
fp.close

#Needed if the document contains any backslash
input = input.replace('\\', '\\\\')
body = re.search('<body.*?>(.*?)</body>',input,re.IGNORECASE |
re.DOTALL)
if body:
body = body.group(1)
else:
body = "no body section found"

fp=open("./template.tpl")
output = fp.read()
fp.close

body = body + "</body>"
output = re.sub('</body>', body, output)
fp=open("./mynewfile.html","w")
fp.write(output)
fp.close

Sion Arrowsmith · Aug 23, 2004

Fred said:
output = re.sub('</body>', body, output)

Here's another hint: string.replace() is a lot faster than re.sub(),
and doesn't require any extra escaping of the replacement string.

Regular expressions are a bit of a Swiss Army knife in Python.
They'll do the job, but the proper tool will do it better.

Fred · Aug 24, 2004

Here's another hint: string.replace() is a lot faster than re.sub(),
and doesn't require any extra escaping of the replacement string.

Indeed. The replacing line on the same 200KB document takes over
2:30mn using re.sub() but... less than a secod with output.replace().

Thx a bunch for the tip

Fred.

Regular expression confusion	4	Sep 24, 2006
Re: EOL - scanning single-quoted string	2	Aug 4, 2004
Strange re problem	7	Jun 20, 2008
using re module to find " but not " alone ... is this a BUG in re?	5	Jun 12, 2008
Importing the re module fails	2	Dec 7, 2008
re.subn error - please help	1	Jun 28, 2004
HTMLParser and non-ascii html pages	0	Sep 20, 2011
trouble with regex with escaped metachars (URGENT please O:-)	5	Nov 20, 2003

[newbie] Strange behavior of the re module

Fred

Hans Nowak

Fred

Tim Peters

Fred

Fred

Fred

Sion Arrowsmith

Fred

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads