J
jl_post
Dear Perl community,
I'm trying to write Perl code that scans through a C/C++ and
matches string literals. I want to use a regular expression for this,
so that if given these inputs, it will extract these outputs:
input1: before "12 34 56" after
output1: 12 34 56
input2: before "12 34" 56" after
output2: 12 34
input3: before "12 34\" 56" after
output3: 12 34\" 56
input4: before "12 34\\" 56" after
output4: 12 34\\
input5: before "12 34\\\" 56" after
output5: 12 34\\\" 56
input6: before "12 34\\\\" 56" after
output6: 12 34\\\\
Note that inputs 3 through 6 account for the backslash escape
character in that if a double-quote is directly preceded by a non-
escaped backslash, then that double-quote should not be interpreted as
the C string terminator.
At first, I came up with this simple regular expression:
m/" (.*) "/x
this puts everything between the first and the last quote into $1.
This works fine for input1, but reads too much with input2.
Then I changed it to be non-greedy, like this:
m/" (.*?) "/x
which works great for inputs 1 and 2, but now fails with input3, as it
doesn't account for escaped-out quotes.
So then I added a negative look-behind to ensure that the last
character matched by the parentheses is not a backslash (I could use [^
\\] instead of the negative look-behind, but then we won't match empty
strings):
m/" (.*? (?<!\\) ) "/x
This works with inputs 1 through 3, but fails at input4, since the
quote after the double-backslash should be the terminator, but isn't
treated as such (due to the fact that it is preceded by a backslash).
So then I reasoned that, after the last non-backslash matched, an
even number of backslashes can be matched (as each pair of backslashes
represents one literal backslash), so I changed the expression to
this:
m/" (.*? (?<!\\) (\\{2})* ) "/x
Now it works for all the inputs I gave. I then added "?:" to the last
set of parentheses (so it wouldn't offset $2, $3, etc. if I decide to
add more later):
m/" (.*? (?<!\\) (?:\\{2})* ) "/x
I tested this out with the following code:
m/" (.*? (?<!\\) (?:\\{2})*) "/x and print "$1\n" while <>;
before "12 34 56" after # input 1
12 34 56
before "12 34" 56" after # input 2
12 34
before "12 34\" 56" after # input 3
12 34\" 56
before "12 34\\" 56" after #input 4
12 34\\
before "12 34\\\" 56" after # input 5
12 34\\\" 56
before "12 34\\\\" 56" after # input 6
12 34\\\\
So it looks like it works. My question is, even though I came up
with a way of parsing a C string literal, is there a better or simpler
way of doing this?
(Now, I know of the quotewords() function in the Text:arseWords
module, but I don't think it's what I'm looking for. I prefer a
regular expression that extracts the string literal (not individual
tokens), and I can embed it into other regular expressions that look
for other pieces of code.)
I tried "perldoc -q string", but the best advice I could find was
to use Text:arseWords, which I stated before is probably not what I
need.
Thanks!
-- Jean-Luc
I'm trying to write Perl code that scans through a C/C++ and
matches string literals. I want to use a regular expression for this,
so that if given these inputs, it will extract these outputs:
input1: before "12 34 56" after
output1: 12 34 56
input2: before "12 34" 56" after
output2: 12 34
input3: before "12 34\" 56" after
output3: 12 34\" 56
input4: before "12 34\\" 56" after
output4: 12 34\\
input5: before "12 34\\\" 56" after
output5: 12 34\\\" 56
input6: before "12 34\\\\" 56" after
output6: 12 34\\\\
Note that inputs 3 through 6 account for the backslash escape
character in that if a double-quote is directly preceded by a non-
escaped backslash, then that double-quote should not be interpreted as
the C string terminator.
At first, I came up with this simple regular expression:
m/" (.*) "/x
this puts everything between the first and the last quote into $1.
This works fine for input1, but reads too much with input2.
Then I changed it to be non-greedy, like this:
m/" (.*?) "/x
which works great for inputs 1 and 2, but now fails with input3, as it
doesn't account for escaped-out quotes.
So then I added a negative look-behind to ensure that the last
character matched by the parentheses is not a backslash (I could use [^
\\] instead of the negative look-behind, but then we won't match empty
strings):
m/" (.*? (?<!\\) ) "/x
This works with inputs 1 through 3, but fails at input4, since the
quote after the double-backslash should be the terminator, but isn't
treated as such (due to the fact that it is preceded by a backslash).
So then I reasoned that, after the last non-backslash matched, an
even number of backslashes can be matched (as each pair of backslashes
represents one literal backslash), so I changed the expression to
this:
m/" (.*? (?<!\\) (\\{2})* ) "/x
Now it works for all the inputs I gave. I then added "?:" to the last
set of parentheses (so it wouldn't offset $2, $3, etc. if I decide to
add more later):
m/" (.*? (?<!\\) (?:\\{2})* ) "/x
I tested this out with the following code:
m/" (.*? (?<!\\) (?:\\{2})*) "/x and print "$1\n" while <>;
before "12 34 56" after # input 1
12 34 56
before "12 34" 56" after # input 2
12 34
before "12 34\" 56" after # input 3
12 34\" 56
before "12 34\\" 56" after #input 4
12 34\\
before "12 34\\\" 56" after # input 5
12 34\\\" 56
before "12 34\\\\" 56" after # input 6
12 34\\\\
So it looks like it works. My question is, even though I came up
with a way of parsing a C string literal, is there a better or simpler
way of doing this?
(Now, I know of the quotewords() function in the Text:arseWords
module, but I don't think it's what I'm looking for. I prefer a
regular expression that extracts the string literal (not individual
tokens), and I can embed it into other regular expressions that look
for other pieces of code.)
I tried "perldoc -q string", but the best advice I could find was
to use Text:arseWords, which I stated before is probably not what I
need.
Thanks!
-- Jean-Luc