Regex help: delete text only if not within quotation marks

Scott Bass · May 23, 2005

my $string1 = 'titles "statement";';
my $string2 = 'titles "statement"; * comment ;';
my $string3 = "titles '* ; statement';";
my $string4 = "titles '* ; statement'; * comment ;";
$_ = $string4;
s#\*\s*(.+?)\s*\; (?!(["'])(.+?)\1)##;
print "$_\n";

I want to delete text delimited by * ;, but only if * ; does not occur
within quotation marks.

After substitution, strings 1&3 should remain unmodified, strings 2&4 should
equal strings 1&3 respectively. Additional testcase (not listed) is comment
string at the beginning of the string.

I thought I was on the right track with negative lookahead assertion???

Regards,
Scott

Damian James · May 23, 2005

s#\*\s*(.+?)\s*\; (?!(["'])(.+?)\1)##;
...
I want to delete text delimited by * ;, but only if * ; does not occur
within quotation marks.
...
I thought I was on the right track with negative lookahead assertion???

Perhaps others will explain what is going wrong with your regex, but
I'd suggest you look at

perldoc -q delimited

before proceeding further. Treating quotes properly is non-trivial, is
difficult or impossible with a regex alone, and is well supported by a
range of modules. Foremost of these is Text::Balanced.

Aside: Is that in the standard distro yet? I seem to have it and don't
remember installing it, but that doesn't mean much

.

--damian

Anno Siegel · May 23, 2005

Damian James said:
s#\*\s*(.+?)\s*\; (?!(["'])(.+?)\1)##;
...
I want to delete text delimited by * ;, but only if * ; does not occur
within quotation marks.
...
I thought I was on the right track with negative lookahead assertion???

Click to expand...

Perhaps others will explain what is going wrong with your regex, but
I'd suggest you look at

perldoc -q delimited

before proceeding further. Treating quotes properly is non-trivial, is
difficult or impossible with a regex alone, and is well supported by a
range of modules. Foremost of these is Text::Balanced.

Aside: Is that in the standard distro yet? I seem to have it and don't
remember installing it, but that doesn't mean much .

perldoc perlmodlib | grep Balanced
Text::Balanced

so, yes, it's a standard module.

Anno

Scott Bass · May 29, 2005

Damian James said:
s#\*\s*(.+?)\s*\; (?!(["'])(.+?)\1)##;
...
I want to delete text delimited by * ;, but only if * ; does not occur
within quotation marks.
...
I thought I was on the right track with negative lookahead assertion???

Click to expand...

Perhaps others will explain what is going wrong with your regex, but
I'd suggest you look at

perldoc -q delimited

before proceeding further. Treating quotes properly is non-trivial, is
difficult or impossible with a regex alone, and is well supported by a
range of modules. Foremost of these is Text::Balanced.

Aside: Is that in the standard distro yet? I seem to have it and don't
remember installing it, but that doesn't mean much .

--damian

I've looked at:

perldoc -q delimited
Text::Balanced doc
Friedel's *Mastering Regular Expressions*
Lookahead and Lookbehind assertions
Numerous web articles

but still can't figure out how to (in pseudocode):

"delete text delimited by * and ; respectively, including the delimiters,
unless such delimiters are contained within a quoted string"

Most of the articles discuss how to extract a string delimited by quotes;
what I want to do is delete a delimited string NOT further delimited by
quotes.

Negative lookbehind seemed to hold the most promise:

(?<!")bar

matches foobar but not "bar. But, as soon as I change "bar to " bar, it
matches. And negative lookbehind requires fixed length patterns (bummer).

Any additional input appreciated.

Thanks,
Scott

Brian McCauley · May 29, 2005

Scott said:
"delete text delimited by * and ; respectively, including the delimiters,
unless such delimiters are contained within a quoted string"

... negative lookbehind requires fixed length patterns (bummer).

A common trick is to reexpress the problem as "delete text delimited by
* and ; respectively, including the delimiters, only if such delimiters
are preceeded by an even number of quotes".

use strict;
use warnings;

my @strings =
( 'titles "statement";',
'titles "statement"; * comment ;',
"titles '* ; statement';",
"titles '* ; statement'; * comment ;",
);

for ( @strings ) {
s/^((?

?:[^']*'){2})*[^']*)\*[^;]*;/$1/;
print "$_\n";
}
__END__

The above only removes one comment per string. To remove all you can
put the s/// in a loop until it returns false or use \G and /g.

But before you can do either you must first consider the implications of
quote characters appearing between the * and ;. In the above there can
be unbalanced quotes between * and ; and the ; is still seen as the end
of the comment. Is this right?

A whole extra level of complexity is introduced if you want to consider
both single and double quote characters as marking strings. And yet
another if there is some way to quote quote characters within quoted
strings (other than doubling).

You need to be clear in your mind what you want to do in all possible
cases before you can implement it.

my @strings =
( 'titles "statement";',
'titles "statement"; * comment ;',
"titles '* ; statement';",
"titles '* ; statement'; * comment ;",
'titles "statement"; * Don't comment? ;',
"titles "* ; is this a comment";",
"titles '* ; statement'; * comment ; * another comment? ;",
q{titles " * This isn't a comment is it?";},
);

Scott Bass · May 29, 2005

Brian McCauley said:
Scott said:

"delete text delimited by * and ; respectively, including the delimiters,
unless such delimiters are contained within a quoted string"

Click to expand...

... negative lookbehind requires fixed length patterns (bummer).

Click to expand...

A common trick is to reexpress the problem as "delete text delimited by *
and ; respectively, including the delimiters, only if such delimiters are
preceeded by an even number of quotes".

use strict;
use warnings;

my @strings =
( 'titles "statement";',
'titles "statement"; * comment ;',
"titles '* ; statement';",
"titles '* ; statement'; * comment ;",
);

for ( @strings ) {
s/^((??:[^']*'){2})*[^']*)\*[^;]*;/$1/;
print "$_\n";
}
__END__

The above only removes one comment per string. To remove all you can put
the s/// in a loop until it returns false or use \G and /g.

But before you can do either you must first consider the implications of
quote characters appearing between the * and ;. In the above there can be
unbalanced quotes between * and ; and the ; is still seen as the end of
the comment. Is this right?

No. The titles statements consist of:

titles ['"] text ['"] ;

Comments can be delimited by either /* */ or * ; and can either precede or
follow the titles statements.

Within the titles text, any comment delimiter (/* */ *

can appear as
text, as well as unbalanced quotes.

Quotes can appear as either:

titles 'text "text" text';
titles "text 'text' text";
titles 'text ''text'' text';
titles "text ""text"" text";

In all these scenarios, I want to remove any comments from the text of the
titles block, but leave the code itself intact.

Actually, I can live with not covering all scenarios. The most common
"tricky" scenario would be an asterisk in titles text, with either comment
style following, eg;

titles "PROC FREQ output of var1*var2"; * var2 may have missing values ;
titles "PROC FREQ output of var1*var2"; /* var2 may have missing values */

BTW, what I am working on is something like POD, but for non-Perl files. My
script extracts structured text and builds a documentation file. Within the
program source code, this structured text can appear within the titles
statements, so the programmer only has to specify the text once - once for
the titles statement (executable code) and once for the "POD" output.

A whole extra level of complexity is introduced if you want to consider
both single and double quote characters as marking strings. And yet
another if there is some way to quote quote characters within quoted
strings (other than doubling).

You need to be clear in your mind what you want to do in all possible
cases before you can implement it.

Brian, thank you *so* much for the code you posted. Much appreciated.

Here is a more realistic test case:

use strict;
use warnings;

my @strings1 = (
'1titles "statement";',
'2titles "statement"; * comment ;',
"3titles '* ; statement';",
"4titles '* ; statement'; * comment ;",
'5titles "* ; statement"; * comment ;',
"* comment ; 6titles '* ; statement'; * comment ;",
'7titles "statement"; * Don\'t comment? ;',
"8titles '* ; statement'; * comment ; * another comment? ;",
q{9titles " * This isn't a comment is it?"; * comment ;},
);

my @strings2 = (
'Atitles "statement"; /* comment */',
"Btitles '/* */ statement';",
"Ctitles '/* */ statement'; /* comment */",
"/* comment */ Dtitles '/* */ statement'; /* comment */",
'Etitles "statement"; /* Don\'t comment? */',
'Ftitles "/* */ is this a comment";',
"Gtitles '/* */ statement'; /* comment */ /* another comment? */",
q{Htitles "/* This isn't a comment is it?" */},
);

for ( @strings1 ) {
1 while s/^((?

?:[^']*'){2})*[^']*)\*[^;]*;/$1/;
print "$_\n";
}

for ( @strings2 ) {
1 while s/^((?

?:[^']*'){2})*[^']*)\*[^;]*;/$1/;
print "$_\n";
}
__END__

The code you posted (with the addition of the loop) works for all scenarios
except #5 and #9 (see numbers added to titles statements above). As stated
above, double quotes are also valid delimiters, so I will need to figure out
how to add them to the RE. Scenarios #4 & #5 are the most common; I will
need to code for #5, but can live with some of the other scenarios failing.

I didn't add the scenarios in @strings2 in my previous posts because I was
hoping that, with a solution to @strings1, I could work out how to code
@strings2. My mistake, mea culpa.

The sad fact is, I've come to realize I'm over my head here regarding these
"fancy" regular expressions (as described here
http://www.unix.org.ua/orelly/perl/prog3/ch05_10.htm). I've got the various
O'Reilly Perl books, including Mastering Regular Expressions, but I'm going
to have to read, re-read, re-read, and hack around until I "master regular
expressions" (as the title suggests :-/)

I don't expect you to do my work, so I'll just have to study and hack around
with this until I get it to work.

I really, *really* appreciate your most helpful replies to my recent posts.

Kind Regards,
Scott

Scott Bass · May 29, 2005

"Scott Bass" <usenet739_yahoo_com_au> wrote in message

[snip]

Quotes can appear as either:

titles 'text "text" text'; ###
titles "text 'text' text"; ###
titles 'text ''text'' text';
titles "text ""text"" text";

Sorry, these are also valid:

titles 'text "unbalanced quotes';
titles "text 'unbalanced quotes"; <<<
titles 'text ''unbalanced quotes';
titles "text ""unbalanced quotes";

The ### lines above above are the most common, but all of these are valid.
The <<< is most often used with contractions in the title text.

Brian McCauley · May 29, 2005

Scott said:
"Scott Bass" <usenet739_yahoo_com_au> wrote in message

[snip]

Quotes can appear as either:

titles 'text "text" text'; ###
titles "text 'text' text"; ###
titles 'text ''text'' text';
titles "text ""text"" text";

Click to expand...

Sorry, these are also valid:

titles 'text "unbalanced quotes';
titles "text 'unbalanced quotes"; <<<
titles 'text ''unbalanced quotes';
titles "text ""unbalanced quotes";

Well coping with those isn't two hard - quoting quotes within quotes by
doubling is easy to parse.

The following replaces all occurances of 'target' with 'replace' except
where it appears inside a quoted string.

s/\G((?:[^"']*(["'])*[^\2]*\2)*[^"']*)target/${1}replace/g;

In another branch of this thread I'll now go think about your real /target/.

Brian McCauley · May 29, 2005

[ Please see my first follow-up (confusingly a bit further down this
thread) first ]

Scott said:
Brian McCauley said:

[...] you must first consider the implications of
quote characters appearing between the * and ;. In the above there can be
unbalanced quotes between * and ; and the ; is still seen as the end of
the comment. Is this right?

Click to expand...

Comments can be delimited by either /* */ or * ; and can either precede or
follow the titles statements.

Within the titles text, any comment delimiter (/* */ * can appear as
text, as well as unbalanced quotes.

Yes, but you didn't answer my question about quote characters inside
comments.

'7titles "statement"; * Don\'t comment? ;',

'Etitles "statement"; /* Don\'t comment? */',

I'll assume from these examples the quote characters inside comments are
not treated as quotes.

Thus ignoring the small issue of ignoring comments within
quotes we have:

s/\/\*.*?\*\/|\*.*?;//g;

However if I join it all together...

s/\G((?:[^"']*(["'])*[^\2]*\2)*[^"']*)(?:\/\*.*?\*\/|\*.*?)/$1/g;

....it crashes the Perl compiler on this box (v5.8.4 built for
MSWin32-x86-multi-thread).

Oh well.

Anno Siegel · May 29, 2005

[...]

However if I join it all together...

s/\G((?:[^"']*(["'])*[^\2]*\2)*[^"']*)(?:\/\*.*?\*\/|\*.*?)/$1/g;

...it crashes the Perl compiler on this box (v5.8.4 built for
MSWin32-x86-multi-thread).

Same here (v5.8.6 built for darwin-2level). It doesn't segfault when the
second appearance of \2 is wrapped in a character class:

s/\G((?:[^"']*(["'])*[^\2]*[\2])*[^"']*)(?:\/\*.*?\*\/|\*.*?)/>$1</g;

That shouldn't change the semantics, but that's little comfort.

Anno

Brian McCauley · May 30, 2005

Anno said:
[...]

However if I join it all together...

s/\G((?:[^"']*(["'])*[^\2]*\2)*[^"']*)(?:\/\*.*?\*\/|\*.*?)/$1/g;

...it crashes the Perl compiler on this box (v5.8.4 built for
MSWin32-x86-multi-thread).

Click to expand...

Same here (v5.8.6 built for darwin-2level). It doesn't segfault when the
second appearance of \2 is wrapped in a character class:

s/\G((?:[^"']*(["'])*[^\2]*[\2])*[^"']*)(?:\/\*.*?\*\/|\*.*?)/>$1</g;

That shouldn't change the semantics, but that's little comfort.

Maybe it's more comforting to get rid of the back reference:

s/\G((?:[^"']*(?:"[^"]*"|'[^']*'))*[^"']*)(?:\/\*.*?\*\/|\*.*?

/$1/g;

(Note: I've put back the missing semicolon that was mysteriously lost in
my previous post).

Why is this WordPress comments form not submitting?	1	Jan 12, 2020
How to find which (if any) member of a list is in a given line of text	3	Jul 18, 2006
Text processing	29	Sep 26, 2011
FAQ 5.2 How do I change, delete, or insert a line in a file, or append to the beginning of a file?	0	Feb 24, 2011
Help needed to retrieve text from a text-file using RegEx	4	Feb 9, 2009
Regex substitute w/ match variables	12	May 5, 2005
Parsing an text file	3	Apr 1, 2007
HTML::ParseTree delete/modify child text	0	Jun 4, 2004

Regex help: delete text only if not within quotation marks

Scott Bass

Damian James

Anno Siegel

Scott Bass

Brian McCauley

Scott Bass

Scott Bass

Brian McCauley

Brian McCauley

Anno Siegel

Brian McCauley

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads