regular expressions, substituting and adding in one step?

J

John Salerno

Ok, this might look familiar. I'd like to use regular expressions to
change this line:

self.source += '<p>' + paragraph + '</p>\n\n'

to read:

self.source += '<p>%s</p>\n\n' % paragraph

Now, matching the middle part and replacing it with '%s' is easy, but
how would I add the extra string to the end of the line? Is it done all
at once, or must I make a new regex to match?

Also, I figure I'd use a group to match the word 'paragraph', and use
that group to insert the word at the end, but how will I 'retain' the
state of \1 if I use more than one regex to do this?

I'd like to do this for several lines, so I'm trying not to make it too
specific (i.e., matching the entire line, for example, and then adding
text after it, if that's possible).

So the questions are, how do you use regular expressions to add text to
the end of a line, even if you aren't matching the end of the line in
the first place? Or does that entail using separate regexes that *do* do
this? If the latter, how do I retain the value of the groups taken from
the first re?

Thanks, hope that made some sense.
 
J

John Salerno

John said:
So the questions are, how do you use regular expressions to add text to
the end of a line, even if you aren't matching the end of the line in
the first place? Or does that entail using separate regexes that *do* do
this? If the latter, how do I retain the value of the groups taken from
the first re?

Here's what I have so far:

-----------

import re

txt_file = open(r'C:\Python24\myscripts\re_test.txt')
new_string = re.sub(r"' \+ ([a-z]+) \+ '", '%s', txt_file.read())
new_string = re.sub(r'$', ' % paragraph', new_string)
txt_file.close()

-----------

re_test.txt contains:

self.source += '<p>' + paragraph + '</p>\n\n'

Both substitutions work, but now I just need to figure out how to
replace the hard-coded ' % paragraph' parameter with something that uses
the group taken from the first regex. I'm guessing if I don't use it at
that time, then it's lost. I suppose I could create a MatchObject and
save group(1) as a variable for later use, but that would be a lot of
extra steps, so I wanted to see if there's a way to do it all at one
time with regular expressions.

Thanks.
 
P

Paul McGuire

John Salerno said:
Ok, this might look familiar. I'd like to use regular expressions to
change this line:

self.source += '<p>' + paragraph + '</p>\n\n'

to read:

self.source += '<p>%s</p>\n\n' % paragraph
John -

You've been asking for re-based responses, so I apologize in advance for
this digression. Pyparsing is an add-on Python module that can provide a
number of features beyond just text matching and parsing. Pyparsing allows
you to define callbacks (or "parse actions") that get invoked during the
parsing process, and these callbacks can modify the matched text.

Since your re approach seems to be on a fairly convergent path, I felt I
needed to come up with more demanding examples to justify a pyparsing
solution. So I contrived these additional cases:

self.source += '<p>' + paragraph + '</p>\n\n'
listItem1 = '<li>' + someText + '</li>'
listItem2 = '<li>' + someMoreText + '</li>'
self.source += '<ul>' + listItem1 + '\n' + listItem2 + '\n' + '</ul>\n\n'

The following code processes these expressions. Admittedly, it is not as
terse as your re-based code samples have been, but it may give you another
data point in your pursuite of a solution. (The pyparsing home wiki is at
http://pyparsing.wikispaces.com.)

The purpose of the intermediate classes is to convert the individual terms
of the string expresssion into a list of string terms, either variable
references or quoted literals. This conversion is done in the term-specific
parse actions created by makeTermParseAction. Then the overall string
expression gets its own parse action, which processes the list of term
objects, and creates the modified string expression. Two different string
expression conversion functions are shown, one generating string
interpolation expressions, and one generating "".join() expressions.

Hope this helps, or is at least mildly entertaining,
-- Paul


================
from pyparsing import *

testLines = r"""
self.source += '<p>' + paragraph + '</p>\n\n'
listItem1 = '<li>' + someText + '</li>'
listItem2 = '<li>' + someMoreText + '</li>'
self.source += '<ul>' + listItem1 + '\n' + listItem2 + '\n' + '</ul>\n\n'
"""

# define some classes to use during parsing
class StringExprTerm(object):
def __init__(self,content):
self.content = content

class VarRef(StringExprTerm):
pass

class QuotedLit(StringExprTerm):
pass

def makeTermParseAction(cls):
def parseAction(s,l,tokens):
return cls(tokens[0])
return parseAction

# define parts we want to recognize as terms in a string expression
varName = Word(alphas+"_", alphanums+"_")
varName.setParseAction( makeTermParseAction( VarRef ) )
quotedString.setParseAction( removeQuotes, makeTermParseAction(
QuotedLit ) )
stringTerm = varName | quotedString

# define a string expression in terms of term expressions
PLUS = Suppress("+")
EQUALS = Suppress("=")
stringExpr = EQUALS + stringTerm + ZeroOrMore( PLUS + stringTerm )

# define a parse action, to be invoked every time a string expression is
found
def interpolateTerms(originalString,locn,tokens):
out = []
refs = []
terms = tokens
for term in terms:
if isinstance(term,QuotedLit):
out.append( term.content )
elif isinstance(term,VarRef):
out.append( "%s" )
refs.append( term.content )
else:
print "hey! this is impossible!"

# generate string to be interpolated, and interp operator
outstr = "'" + "".join(out) + "' % "

# generate interpolation argument tuple
if len(refs) > 1:
outstr += "(" + ",".join(refs) + ")"
else:
outstr += ",".join(refs)

# return generated string (don't forget leading = sign)
return "= " + outstr

stringExpr.setParseAction( interpolateTerms )

print "Original:",
print testLines
print
print "Modified:",
print stringExpr.transformString( testLines )

# define slightly different parse action, to use list join instead of string
interp
def createListJoin(originalString,locn,tokens):
out = []
terms = tokens
for term in terms:
if isinstance(term,QuotedLit):
out.append( "'" + term.content + "'" )
elif isinstance(term,VarRef):
out.append( term.content )
else:
print "hey! this is impossible!"

# generate string to be interpolated, and interp operator
outstr = "[" + ",".join(out) + "]"

# return generated string (don't forget leading = sign)
return "= ''.join(" + outstr + ")"

del stringExpr.parseAction[:]
stringExpr.setParseAction( createListJoin )

print
print "Modified (2):",
print stringExpr.transformString( testLines )

================
Prints out:
Original:
self.source += '<p>' + paragraph + '</p>\n\n'
listItem1 = '<li>' + someText + '</li>'
listItem2 = '<li>' + someMoreText + '</li>'
self.source += '<ul>' + listItem1 + '\n' + listItem2 + '\n' + '</ul>\n\n'

Modified:
self.source += '<p>%s</p>\n\n' % paragraph
listItem1 = '<li>%s</li>' % someText
listItem2 = '<li>%s</li>' % someMoreText
self.source += '<ul>%s\n%s\n</ul>\n\n' % (listItem1,listItem2)

Modified (2):
self.source += ''.join(['<p>',paragraph,'</p>\n\n'])
listItem1 = ''.join(['<li>',someText,'</li>'])
listItem2 = ''.join(['<li>',someMoreText,'</li>'])
self.source += ''.join(['<ul>',listItem1,'\n',listItem2,'\n','</ul>\n\n'])
================
 
K

Kent Johnson

John said:
Ok, this might look familiar. I'd like to use regular expressions to
change this line:

self.source += '<p>' + paragraph + '</p>\n\n'

to read:

self.source += '<p>%s</p>\n\n' % paragraph

Now, matching the middle part and replacing it with '%s' is easy, but
how would I add the extra string to the end of the line? Is it done all
at once, or must I make a new regex to match?

Also, I figure I'd use a group to match the word 'paragraph', and use
that group to insert the word at the end, but how will I 'retain' the
state of \1 if I use more than one regex to do this?

Do it all in one match / substitution using \1 to insert the value of
the paragraph group at the new location:

In [19]: test = "self.source += '<p>' + paragraph + '</p>\n\n'"

In [20]: re.sub(r"'<p>' \+ (.*?) \+ '</p>\n\n'", r"'<p>%s</p>\n\n' %
\1", test)
Out[20]: "self.source += '<p>%s</p>\n\n' % paragraph"

Kent
 
J

John Salerno

Kent said:
Do it all in one match / substitution using \1 to insert the value of
the paragraph group at the new location:

In [19]: test = "self.source += '<p>' + paragraph + '</p>\n\n'"

In [20]: re.sub(r"'<p>' \+ (.*?) \+ '</p>\n\n'", r"'<p>%s</p>\n\n' %
\1", test)
Out[20]: "self.source += '<p>%s</p>\n\n' % paragraph"

Interesting. Thanks! I was just doing some more reading of the re
module, so now I understand sub() better. I'll give this a try too. Call
me crazy, but I'm interested in regular expressions right now. :)
 
K

Kent Johnson

John said:
Call
me crazy, but I'm interested in regular expressions right now. :)

Not crazy at all. REs are a powerful and useful tool that every
programmer should know how to use. They're just not the right tool for
every job!

Kent
 
J

John Salerno

Kent said:
They're just not the right tool for
every job!

Thank god for that! As easy as they've become to me (after seeming
utterly cryptic and impenetrable), they are still a little unwieldy.
Next step: learn how to write look-ahead and look-behind REs! :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,962
Messages
2,570,134
Members
46,690
Latest member
MacGyver

Latest Threads

Top