[newbie] Strange behavior of the re module

F

Fred

Hi,

While parsing through a bunch of HTML pages using the latest
ActivePython, I experienced something funny using the re module. I
extracted the part that generates the errors (I'm just trying to
substitute once item with another in a string):

--------------------------------
import re

#NOK : doesn't like a single, ending backslash
#stuff = "\colortbl\red0\green0\"
# => SyntaxError: EOL while scanning single-quoted string

#NOK : doesn't like gn0? :)
stuff="\colortbl\red0\gn0"

# => traceback (most recent call last):
# File "C:\test.py", line 10, in ?
# template = re.sub('BLA', stuff, template)
# File "G:\Python23\lib\sre.py", line 143, in sub
# return _compile(pattern, 0).sub(repl, string, count)
# File "G:\Python23\lib\sre.py", line 257, in _subx
# template = _compile_repl(template, pattern)
# File "G:\Python23\lib\sre.py", line 244, in _compile_repl
# raise error, v # invalid expression
#sre_constants.error: bad group name

#OK....
stuff="\colortbl\red0\n0"

template = "BLA"

template = re.sub('BLA', stuff, template)
--------------------------------

=> It appears that the re module isn't very friendly with backslashes,
at least on the Windows platform. Does someone know why, and what I
could do, since I can't rewrite the source HTML documents that contain
backslashes.

Thank you
Fred.
 
H

Hans Nowak

Fred said:
stuff="\colortbl\red0\n0"

template = "BLA"

template = re.sub('BLA', stuff, template)
--------------------------------

=> It appears that the re module isn't very friendly with backslashes,
at least on the Windows platform. Does someone know why, and what I
could do, since I can't rewrite the source HTML documents that contain
backslashes.

It's not the re module, it's that backslashes have special meaning in string
literals. See also:

http://docs.python.org/tut/node5.html#SECTION005120000000000000000

http://docs.python.org/ref/strings.html

To use a non-escaping backslash in a string literal, use a double backslash:

stuff = "\\colortbl\\red0\\n0"

or a raw string:

stuff = r"\colortbl\red0\n0"

HTH,
 
F

Fred

To use a non-escaping backslash in a string literal, use a double backslash:

stuff = "\\colortbl\\red0\\n0"

or a raw string:

stuff = r"\colortbl\red0\n0"

Thx Hans for the prompt answer. I'll have to use the second form since
I can't modify the content of the HTML pages I'm looping through...
but no matter which option I use (either r or R), Python is still not
happy:

---------------------------------------
import re

#NOK
stuff=r"\colortbl\red0\gn0"
#NOK
stuff=R"\colortbl\red0\gn0"

template = "BLA"
template = re.sub('BLA', stuff, template)
---------------------------------------

Traceback (most recent call last):
File "C:\test.py", line 9, in ?
template = re.sub('BLA', stuff, template)
File "G:\Python23\lib\sre.py", line 143, in sub
return _compile(pattern, 0).sub(repl, string, count)
File "G:\Python23\lib\sre.py", line 257, in _subx
template = _compile_repl(template, pattern)
File "G:\Python23\lib\sre.py", line 244, in _compile_repl
raise error, v # invalid expression
sre_constants.error: bad group name

Maybe the r/R prefix is not available in ActivePython?

Thanks
Fred.
 
T

Tim Peters

[Fred said:
import re

#NOK
stuff=r"\colortbl\red0\gn0"
#NOK
stuff=R"\colortbl\red0\gn0"

template = "BLA"
template = re.sub('BLA', stuff, template)
---------------------------------------

Traceback (most recent call last):
File "C:\test.py", line 9, in ?
template = re.sub('BLA', stuff, template)
File "G:\Python23\lib\sre.py", line 143, in sub
return _compile(pattern, 0).sub(repl, string, count)
File "G:\Python23\lib\sre.py", line 257, in _subx
template = _compile_repl(template, pattern)
File "G:\Python23\lib\sre.py", line 244, in _compile_repl
raise error, v # invalid expression
sre_constants.error: bad group name

I can't figure out what you're trying to accomplish here, but the
error msg makes sense. You should pause to read the docs for re.sub.
In

re.sub('BLA', stuff, template)

'BLA' is the regular expression, stuff is the substitution pattern,
and template is the input string. As the docs say, \g in the
substitution pattern has special meaning, specifying the name of a
named capturing group. Your regular expression ('BLA') has no
capturing groups (let alone named ones), so using \g in the
substitution pattern can't work.

If you really want to search for the regular expression 'BLA' in
template and replace each occurence with the string

r"\colortbl\red0\gn0"

then you need to escape all characters with special meaning in the
substitution pattern, via re.escape():
 
F

Fred

I can't figure out what you're trying to accomplish here, but the
error msg makes sense.

I'm actually writing a script that extracts parts of HTML pages, but
some pages contain backslashes, which is why the script failed when
massaging those particular pages.

That did it :) Thx a bunch.

Fred.
 
F

Fred

That did it :) Thx a bunch.

Mmmm... The above links and hints did teach more about the infamous
"blackslash plague", but I'm still stuck because all the examples
consider static strings, while I'm building it dynamically by
extracting data from a web page through the re module:

----------------------------------------------
import sys
import re

#1. Extract stuff between BODY tags
input = "<body>c:\temp</body>"
body = re.search('<body.*?>(.*?)</body>',input,re.IGNORECASE |
re.DOTALL)
if body:
body = body.group(1)
print "Body = " + body

#2. Insert extracted stuff into template
output = "<body>here's the path: </body>"
output = re.sub('</body>', body + "</body>", output)
print output
----------------------------------------------

I also tried running this before so that the problem would go away,
but Python doesn't like it either:

body = re.sub(r'\',r'\\',input)

The script does run, but
Fred.
 
F

Fred

The script does run, but

Guess I hit the Send button instead of Save ;-)

OK, for those interested, here's some working code, although it's
pretty slow (2mn30 when massaging a 200KB file on a P3 host):

--------------------
#The goal is to read an HTML file, extract whatever's between <body>
and </body>, read a template file, and insert what we extracted from
the first document:

import sys
import re

fp=open("./mydoc.html")
input = fp.read()
fp.close

#Needed if the document contains any backslash
input = input.replace('\\', '\\\\')
body = re.search('<body.*?>(.*?)</body>',input,re.IGNORECASE |
re.DOTALL)
if body:
body = body.group(1)
else:
body = "no body section found"

fp=open("./template.tpl")
output = fp.read()
fp.close

body = body + "</body>"
output = re.sub('</body>', body, output)
fp=open("./mynewfile.html","w")
fp.write(output)
fp.close
 
S

Sion Arrowsmith

Fred said:
output = re.sub('</body>', body, output)

Here's another hint: string.replace() is a lot faster than re.sub(),
and doesn't require any extra escaping of the replacement string.

Regular expressions are a bit of a Swiss Army knife in Python.
They'll do the job, but the proper tool will do it better.
 
F

Fred

Here's another hint: string.replace() is a lot faster than re.sub(),
and doesn't require any extra escaping of the replacement string.

Indeed. The replacing line on the same 200KB document takes over
2:30mn using re.sub() but... less than a secod with output.replace().

Thx a bunch for the tip :)
Fred.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,705
Latest member
Stefkari24

Latest Threads

Top