Problem with re module

John Harrington · Mar 22, 2011

I'm trying to use the following substitution,

lineList=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
\2',lineList)

I intend this to match any string "\begin{document}" that doesn't end
in a line ending. If there's no line ending, then, I want to place
two carriage returns between the string and the non-line end
character.

However, this places carriage returns even when the string is followed
directly after with a line ending. Can someone explain to me why this
match is not behaving as I intend it to, especially the ([^$])?

Also, how can I write a regex that matches what I wish to match, as
described above?

Many thanks,
John

John Bokma · Mar 22, 2011

John Harrington said:
I'm trying to use the following substitution,

lineList=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
\2',lineList)

I intend this to match any string "\begin{document}" that doesn't end
in a line ending. If there's no line ending, then, I want to place
two carriage returns between the string and the non-line end
character.

However, this places carriage returns even when the string is followed
directly after with a line ending. Can someone explain to me why this
match is not behaving as I intend it to, especially the ([^$])?

[^$] matches: not a $ character

You might want [^\n]

Peter Otten · Mar 22, 2011

John said:
I'm trying to use the following substitution,

lineList=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
\2',lineList)

I intend this to match any string "\begin{document}" that doesn't end
in a line ending. If there's no line ending, then, I want to place
two carriage returns between the string and the non-line end
character.

However, this places carriage returns even when the string is followed
directly after with a line ending. Can someone explain to me why this
match is not behaving as I intend it to, especially the ([^$])?

Quoting http://docs.python.org/library/re.html:
"""
Special characters are not active inside sets. For example, [akm$] will
match any of the characters 'a', 'k', 'm', or '$';
"""

Also, how can I write a regex that matches what I wish to match, as
described above?

Click to expand...

I think you want a "negative lookahead assertion", (?!...):
xxx\naaa xxx bbb\nxxx")
aaa bbb xxx
aaa xxx** bbb
xxx

John Harrington · Mar 22, 2011

John Harrington said:
John Harrington said:

I'm trying to use the following substitution,

Click to expand...

lineList=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
\2',lineList)

Click to expand...

I intend this to match any string "\begin{document}" that doesn't end
in a line ending. If there's no line ending, then, I want to place
two carriage returns between the string and the non-line end
character.

Click to expand...

However, this places carriage returns even when the string is followed
directly after with a line ending. Can someone explain to me why this
match is not behaving as I intend it to, especially the ([^$])?

Click to expand...

[^$] matches: not a $ character

You might want [^\n]

Thank you, John.

I thought that when you use "r" before the regex, $ matches an end of
line. But, in any case, if I use "[^\n]" as you suggest I get the
same result.

Here's a script that illustrates the problem. Any help would be
appreciated!:

#BEGIN SCRIPT
import re

outlist = []
myfile = "raw.tex"

fin = open(myfile, "r")
lineList = fin.readlines()
fin.close()

for i in range(0,len(lineList)):

lineList=re.sub(r'(\\begin{document})([^\n])',r'\1\n\n
\2',lineList)

outlist.append(lineList)

fou = open(myfile, "w")
for i in range(len(outlist)):
fou.write(outlist)
fou.close
#END SCRIPT

And the file raw.tex:

%BEGIN TeX FILE
\begin{document}
This line should remain right after the above line in the output, but
doesn't

\begin{document}Extra stuff here should appear below the begin line
and does in the output.
%END TeX FILE

Benjamin Kaplan · Mar 22, 2011

John Harrington said:
John Harrington said:

I'm trying to use the following substitution,

Click to expand...

lineList=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
\2',lineList)

Click to expand...

I intend this to match any string "\begin{document}" that doesn't end
in a line ending. If there's no line ending, then, I want to place
two carriage returns between the string and the non-line end
character.

Click to expand...

However, this places carriage returns even when the string is followed
directly after with a line ending. Can someone explain to me why this
match is not behaving as I intend it to, especially the ([^$])?

Click to expand...

[^$] matches: not a $ character

You might want [^\n]

Click to expand...

Thank you, John.

I thought that when you use "r" before the regex, $ matches an end of
line. But, in any case, if I use "[^\n]" as you suggest I get the
same result.

r before a string has nothing to do with regexes. It signals a raw
string- escape sequences wont' be escaped.a\tb

We use raw strings for regexes because otherwise, you'd have to
remember double up all your backslashes. And double up your doubled up
backslashes when you really want a backslash.

Here's a script that illustrates the problem. Any help would be
appreciated!:

#BEGIN SCRIPT
import re

outlist = []
myfile = "raw.tex"

fin = open(myfile, "r")
lineList = fin.readlines()
fin.close()

for i in range(0,len(lineList)):

lineList=re.sub(r'(\\begin{document})([^\n])',r'\1\n\n
\2',lineList)

outlist.append(lineList)

fou = open(myfile, "w")
for i in range(len(outlist)):
fou.write(outlist)
fou.close
#END SCRIPT

And the file raw.tex:

%BEGIN TeX FILE
\begin{document}
This line should remain right after the above line in the output, but
doesn't

\begin{document}Extra stuff here should appear below the begin line
and does in the output.
%END TeX FILE

Click to expand...

Works for me. Do you have a space after the \begin{document} or
something? Because that get moved. You might want to check for
non-whitespace characters in the reges instead of just non-newlines.

John Harrington · Mar 22, 2011

I'm trying to use the following substitution,
lineList=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
\2',lineList)
I intend this to match any string "\begin{document}" that doesn't end
in a line ending. If there's no line ending, then, I want to place
two carriage returns between the string and the non-line end
character.
However, this places carriage returns even when the string is followed
directly after with a line ending. Can someone explain to me why this
match is not behaving as I intend it to, especially the ([^$])?
[^$] matches: not a $ character
You might want [^\n]

Click to expand...

Click to expand...

Thank you, John.

Click to expand...

I thought that when you use "r" before the regex, $ matches an end of
line. But, in any case, if I use "[^\n]" as you suggest I get the
same result.

Click to expand...

r before a string has nothing to do with regexes. It signals a raw
string- escape sequences wont' be escaped.>>> print 'a\tb'
a b
a\tb

We use raw strings for regexes because otherwise, you'd have to
remember double up all your backslashes. And double up your doubled up
backslashes when you really want a backslash.

Here's a script that illustrates the problem. Any help would be
appreciated!:

Click to expand...

#BEGIN SCRIPT
import re

Click to expand...

outlist = []
myfile = "raw.tex"

Click to expand...

fin = open(myfile, "r")
lineList = fin.readlines()
fin.close()

Click to expand...

for i in range(0,len(lineList)):

Click to expand...

lineList=re.sub(r'(\\begin{document})([^\n])',r'\1\n\n
\2',lineList)

Click to expand...

outlist.append(lineList)

Click to expand...

fou = open(myfile, "w")
for i in range(len(outlist)):
fou.write(outlist)
fou.close
#END SCRIPT

Click to expand...

And the file raw.tex:

Click to expand...

%BEGIN TeX FILE
\begin{document}
This line should remain right after the above line in the output, but
doesn't

Click to expand...

\begin{document}Extra stuff here should appear below the begin line
and does in the output.
%END TeX FILE

Click to expand...

Works for me. Do you have a space after the \begin{document} or
something? Because that get moved. You might want to check for
non-whitespace characters in the reges instead of just non-newlines.

Matching the non-whitespace works, but I'm troubled I can't match a
non-end-of-line. No, there was no space after the string.

Thank you for your help, Ben

Ethan Furman · Mar 22, 2011

John said:
Here's a script that illustrates the problem. Any help would be
appreciated!:

#BEGIN SCRIPT
import re

outlist = []
myfile = "raw.tex"

fin = open(myfile, "r")
lineList = fin.readlines()
fin.close()

for i in range(0,len(lineList)):

lineList=re.sub(r'(\\begin{document})([^\n])',r'\1\n\n
\2',lineList)

outlist.append(lineList)

fou = open(myfile, "w")
for i in range(len(outlist)):
fou.write(outlist)
fou.close
#END SCRIPT

And the file raw.tex:

%BEGIN TeX FILE
\begin{document}
This line should remain right after the above line in the output, but
doesn't

\begin{document}Extra stuff here should appear below the begin line
and does in the output.
%END TeX FILE

Here's the important tidbit:

re.sub(r'(\\begin{document})(.+)', r'\1\n\n\2', line)

From the docs:
'.'
(Dot.) In the default mode, this matches any character except a newline.
If the DOTALL flag has been specified, this matches any character
including a newline.

'+'
Causes the resulting RE to match 1 or more repetitions of the preceding
RE. ab+ will match â€˜aâ€™ followed by any non-zero number of â€˜bâ€™s; it will
not match just â€˜aâ€™.

And here's the entire program, a bit more pythonically:

8<---------------------------------------------------------------
import re

outlist = []
myfile = "raw.tex"

fin = open(myfile, "r")
lineList = fin.readlines()
fin.close()

for line in lineList:
line = re.sub(r'(\\begin{document})(.+)', r'\1\n\n\2', line)
outlist.append(line)

fou = open(myfile, "w")
for line in outlist:
fou.write(line)
fou.close
8<---------------------------------------------------------------

Hope this helps!

~Ethan~

Re for Apache log file format	4	Oct 8, 2013
Must be a bug in the re module [was: Why this result with the remodule]	0	Nov 3, 2010
Issue with textbox script?	0	Sep 5, 2022
How to define repeated string when using the re module?	0	Aug 2, 2011
a list/re problem	12	Dec 11, 2009
Help with code	0	Jun 12, 2022
Why is Python telling me variable is local not global?	3	Sep 2, 2023
Working with named groups in re module	2	Jan 10, 2007

Problem with re module

John Harrington

John Bokma

Peter Otten

John Harrington

Benjamin Kaplan

John Harrington

Ethan Furman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads