Problem with re module

J

John Harrington

I'm trying to use the following substitution,

lineList=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
\2',lineList)

I intend this to match any string "\begin{document}" that doesn't end
in a line ending. If there's no line ending, then, I want to place
two carriage returns between the string and the non-line end
character.

However, this places carriage returns even when the string is followed
directly after with a line ending. Can someone explain to me why this
match is not behaving as I intend it to, especially the ([^$])?

Also, how can I write a regex that matches what I wish to match, as
described above?

Many thanks,
John
 
J

John Bokma

John Harrington said:
I'm trying to use the following substitution,

lineList=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
\2',lineList)

I intend this to match any string "\begin{document}" that doesn't end
in a line ending. If there's no line ending, then, I want to place
two carriage returns between the string and the non-line end
character.

However, this places carriage returns even when the string is followed
directly after with a line ending. Can someone explain to me why this
match is not behaving as I intend it to, especially the ([^$])?


[^$] matches: not a $ character

You might want [^\n]
 
P

Peter Otten

John said:
I'm trying to use the following substitution,

lineList=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
\2',lineList)

I intend this to match any string "\begin{document}" that doesn't end
in a line ending. If there's no line ending, then, I want to place
two carriage returns between the string and the non-line end
character.

However, this places carriage returns even when the string is followed
directly after with a line ending. Can someone explain to me why this
match is not behaving as I intend it to, especially the ([^$])?


Quoting http://docs.python.org/library/re.html:
"""
Special characters are not active inside sets. For example, [akm$] will
match any of the characters 'a', 'k', 'm', or '$';
"""
Also, how can I write a regex that matches what I wish to match, as
described above?

I think you want a "negative lookahead assertion", (?!...):
xxx\naaa xxx bbb\nxxx")
aaa bbb xxx
aaa xxx** bbb
xxx
 
J

John Harrington

John Harrington said:
I'm trying to use the following substitution,
     lineList=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
\2',lineList)

I intend this to match any string "\begin{document}" that doesn't end
in a line ending.  If there's no line ending, then, I want to place
two carriage returns between the string and the non-line end
character.
However, this places carriage returns even when the string is followed
directly after with a line ending.  Can someone explain to me why this
match is not behaving as I intend it to, especially the ([^$])?

[^$] matches: not a $ character

You might want [^\n]


Thank you, John.

I thought that when you use "r" before the regex, $ matches an end of
line. But, in any case, if I use "[^\n]" as you suggest I get the
same result.

Here's a script that illustrates the problem. Any help would be
appreciated!:

#BEGIN SCRIPT
import re

outlist = []
myfile = "raw.tex"

fin = open(myfile, "r")
lineList = fin.readlines()
fin.close()

for i in range(0,len(lineList)):

lineList=re.sub(r'(\\begin{document})([^\n])',r'\1\n\n
\2',lineList)

outlist.append(lineList)

fou = open(myfile, "w")
for i in range(len(outlist)):
fou.write(outlist)
fou.close
#END SCRIPT

And the file raw.tex:

%BEGIN TeX FILE
\begin{document}
This line should remain right after the above line in the output, but
doesn't

\begin{document}Extra stuff here should appear below the begin line
and does in the output.
%END TeX FILE
 
B

Benjamin Kaplan

John Harrington said:
I'm trying to use the following substitution,
     lineList=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
\2',lineList)

I intend this to match any string "\begin{document}" that doesn't end
in a line ending.  If there's no line ending, then, I want to place
two carriage returns between the string and the non-line end
character.
However, this places carriage returns even when the string is followed
directly after with a line ending.  Can someone explain to me why this
match is not behaving as I intend it to, especially the ([^$])?

[^$] matches: not a $ character

You might want [^\n]


Thank you, John.

I thought that when you use "r" before the regex, $ matches an end of
line.  But, in any case, if I use "[^\n]" as you suggest I get the
same result.



r before a string has nothing to do with regexes. It signals a raw
string- escape sequences wont' be escaped.a\tb

We use raw strings for regexes because otherwise, you'd have to
remember double up all your backslashes. And double up your doubled up
backslashes when you really want a backslash.
Here's a script that illustrates the problem.  Any help would be
appreciated!:

#BEGIN SCRIPT
import re

outlist = []
myfile  = "raw.tex"

fin = open(myfile, "r")
lineList = fin.readlines()
fin.close()

for i in range(0,len(lineList)):

    lineList=re.sub(r'(\\begin{document})([^\n])',r'\1\n\n
\2',lineList)

    outlist.append(lineList)

fou = open(myfile, "w")
for i in range(len(outlist)):
  fou.write(outlist)
fou.close
#END SCRIPT

And the file raw.tex:

%BEGIN TeX FILE
\begin{document}
This line should remain right after the above line in the output, but
doesn't

\begin{document}Extra stuff here should appear below the begin line
and does in the output.
%END TeX FILE


Works for me. Do you have a space after the \begin{document} or
something? Because that get moved. You might want to check for
non-whitespace characters in the reges instead of just non-newlines.
 
J

John Harrington

I'm trying to use the following substitution,
     lineList=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
\2',lineList)
I intend this to match any string "\begin{document}" that doesn't end
in a line ending.  If there's no line ending, then, I want to place
two carriage returns between the string and the non-line end
character.
However, this places carriage returns even when the string is followed
directly after with a line ending.  Can someone explain to me why this
match is not behaving as I intend it to, especially the ([^$])?
[^$] matches: not a $ character
You might want [^\n]

Thank you, John.
I thought that when you use "r" before the regex, $ matches an end of
line.  But, in any case, if I use "[^\n]" as you suggest I get the
same result.

r before a string has nothing to do with regexes. It signals a raw
string- escape sequences wont' be escaped.>>> print 'a\tb'
a       b
a\tb

We use raw strings for regexes because otherwise, you'd have to
remember double up all your backslashes. And double up your doubled up
backslashes when you really want a backslash.


Here's a script that illustrates the problem.  Any help would be
appreciated!:
#BEGIN SCRIPT
import re
outlist = []
myfile  = "raw.tex"
fin = open(myfile, "r")
lineList = fin.readlines()
fin.close()
for i in range(0,len(lineList)):
    lineList=re.sub(r'(\\begin{document})([^\n])',r'\1\n\n
\2',lineList)

    outlist.append(lineList)

fou = open(myfile, "w")
for i in range(len(outlist)):
  fou.write(outlist)
fou.close
#END SCRIPT

And the file raw.tex:
%BEGIN TeX FILE
\begin{document}
This line should remain right after the above line in the output, but
doesn't
\begin{document}Extra stuff here should appear below the begin line
and does in the output.
%END TeX FILE

Works for me. Do you have a space after the \begin{document} or
something? Because that get moved. You might want to check for
non-whitespace characters in the reges instead of just non-newlines.


Matching the non-whitespace works, but I'm troubled I can't match a
non-end-of-line. No, there was no space after the string.

Thank you for your help, Ben
 
E

Ethan Furman

John said:
Here's a script that illustrates the problem. Any help would be
appreciated!:

#BEGIN SCRIPT
import re

outlist = []
myfile = "raw.tex"

fin = open(myfile, "r")
lineList = fin.readlines()
fin.close()

for i in range(0,len(lineList)):

lineList=re.sub(r'(\\begin{document})([^\n])',r'\1\n\n
\2',lineList)

outlist.append(lineList)

fou = open(myfile, "w")
for i in range(len(outlist)):
fou.write(outlist)
fou.close
#END SCRIPT

And the file raw.tex:

%BEGIN TeX FILE
\begin{document}
This line should remain right after the above line in the output, but
doesn't

\begin{document}Extra stuff here should appear below the begin line
and does in the output.
%END TeX FILE


Here's the important tidbit:

re.sub(r'(\\begin{document})(.+)', r'\1\n\n\2', line)

From the docs:
'.'
(Dot.) In the default mode, this matches any character except a newline.
If the DOTALL flag has been specified, this matches any character
including a newline.

'+'
Causes the resulting RE to match 1 or more repetitions of the preceding
RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will
not match just ‘a’.


And here's the entire program, a bit more pythonically:

8<---------------------------------------------------------------
import re

outlist = []
myfile = "raw.tex"

fin = open(myfile, "r")
lineList = fin.readlines()
fin.close()

for line in lineList:
line = re.sub(r'(\\begin{document})(.+)', r'\1\n\n\2', line)
outlist.append(line)

fou = open(myfile, "w")
for line in outlist:
fou.write(line)
fou.close
8<---------------------------------------------------------------

Hope this helps!

~Ethan~
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,164
Messages
2,570,897
Members
47,439
Latest member
shasuze

Latest Threads

Top