Regex help: delete text only if not within quotation marks

S

Scott Bass

my $string1 = 'titles "statement";';
my $string2 = 'titles "statement"; * comment ;';
my $string3 = "titles '* ; statement';";
my $string4 = "titles '* ; statement'; * comment ;";
$_ = $string4;
s#\*\s*(.+?)\s*\; (?!(["'])(.+?)\1)##;
print "$_\n";

I want to delete text delimited by * ;, but only if * ; does not occur
within quotation marks.

After substitution, strings 1&3 should remain unmodified, strings 2&4 should
equal strings 1&3 respectively. Additional testcase (not listed) is comment
string at the beginning of the string.

I thought I was on the right track with negative lookahead assertion???

Regards,
Scott
 
D

Damian James

s#\*\s*(.+?)\s*\; (?!(["'])(.+?)\1)##;
...
I want to delete text delimited by * ;, but only if * ; does not occur
within quotation marks.
...
I thought I was on the right track with negative lookahead assertion???

Perhaps others will explain what is going wrong with your regex, but
I'd suggest you look at

perldoc -q delimited

before proceeding further. Treating quotes properly is non-trivial, is
difficult or impossible with a regex alone, and is well supported by a
range of modules. Foremost of these is Text::Balanced.

Aside: Is that in the standard distro yet? I seem to have it and don't
remember installing it, but that doesn't mean much ;).


--damian
 
A

Anno Siegel

Damian James said:
s#\*\s*(.+?)\s*\; (?!(["'])(.+?)\1)##;
...
I want to delete text delimited by * ;, but only if * ; does not occur
within quotation marks.
...
I thought I was on the right track with negative lookahead assertion???

Perhaps others will explain what is going wrong with your regex, but
I'd suggest you look at

perldoc -q delimited

before proceeding further. Treating quotes properly is non-trivial, is
difficult or impossible with a regex alone, and is well supported by a
range of modules. Foremost of these is Text::Balanced.

Aside: Is that in the standard distro yet? I seem to have it and don't
remember installing it, but that doesn't mean much ;).

perldoc perlmodlib | grep Balanced
Text::Balanced

so, yes, it's a standard module.

Anno
 
S

Scott Bass

Damian James said:
s#\*\s*(.+?)\s*\; (?!(["'])(.+?)\1)##;
...
I want to delete text delimited by * ;, but only if * ; does not occur
within quotation marks.
...
I thought I was on the right track with negative lookahead assertion???

Perhaps others will explain what is going wrong with your regex, but
I'd suggest you look at

perldoc -q delimited

before proceeding further. Treating quotes properly is non-trivial, is
difficult or impossible with a regex alone, and is well supported by a
range of modules. Foremost of these is Text::Balanced.

Aside: Is that in the standard distro yet? I seem to have it and don't
remember installing it, but that doesn't mean much ;).


--damian

I've looked at:

perldoc -q delimited
Text::Balanced doc
Friedel's *Mastering Regular Expressions*
Lookahead and Lookbehind assertions
Numerous web articles

but still can't figure out how to (in pseudocode):

"delete text delimited by * and ; respectively, including the delimiters,
unless such delimiters are contained within a quoted string"

Most of the articles discuss how to extract a string delimited by quotes;
what I want to do is delete a delimited string NOT further delimited by
quotes.

Negative lookbehind seemed to hold the most promise:

(?<!")bar

matches foobar but not "bar. But, as soon as I change "bar to " bar, it
matches. And negative lookbehind requires fixed length patterns (bummer).

Any additional input appreciated.

Thanks,
Scott
 
B

Brian McCauley

Scott said:
"delete text delimited by * and ; respectively, including the delimiters,
unless such delimiters are contained within a quoted string"
... negative lookbehind requires fixed length patterns (bummer).

A common trick is to reexpress the problem as "delete text delimited by
* and ; respectively, including the delimiters, only if such delimiters
are preceeded by an even number of quotes".

use strict;
use warnings;

my @strings =
( 'titles "statement";',
'titles "statement"; * comment ;',
"titles '* ; statement';",
"titles '* ; statement'; * comment ;",
);

for ( @strings ) {
s/^((?:(?:[^']*'){2})*[^']*)\*[^;]*;/$1/;
print "$_\n";
}
__END__

The above only removes one comment per string. To remove all you can
put the s/// in a loop until it returns false or use \G and /g.

But before you can do either you must first consider the implications of
quote characters appearing between the * and ;. In the above there can
be unbalanced quotes between * and ; and the ; is still seen as the end
of the comment. Is this right?

A whole extra level of complexity is introduced if you want to consider
both single and double quote characters as marking strings. And yet
another if there is some way to quote quote characters within quoted
strings (other than doubling).

You need to be clear in your mind what you want to do in all possible
cases before you can implement it.

my @strings =
( 'titles "statement";',
'titles "statement"; * comment ;',
"titles '* ; statement';",
"titles '* ; statement'; * comment ;",
'titles "statement"; * Don't comment? ;',
"titles "* ; is this a comment";",
"titles '* ; statement'; * comment ; * another comment? ;",
q{titles " * This isn't a comment is it?";},
);
 
S

Scott Bass

Brian McCauley said:
Scott said:
"delete text delimited by * and ; respectively, including the delimiters,
unless such delimiters are contained within a quoted string"
... negative lookbehind requires fixed length patterns (bummer).

A common trick is to reexpress the problem as "delete text delimited by *
and ; respectively, including the delimiters, only if such delimiters are
preceeded by an even number of quotes".

use strict;
use warnings;

my @strings =
( 'titles "statement";',
'titles "statement"; * comment ;',
"titles '* ; statement';",
"titles '* ; statement'; * comment ;",
);

for ( @strings ) {
s/^((?:(?:[^']*'){2})*[^']*)\*[^;]*;/$1/;
print "$_\n";
}
__END__

The above only removes one comment per string. To remove all you can put
the s/// in a loop until it returns false or use \G and /g.

But before you can do either you must first consider the implications of
quote characters appearing between the * and ;. In the above there can be
unbalanced quotes between * and ; and the ; is still seen as the end of
the comment. Is this right?

No. The titles statements consist of:

titles ['"] text ['"] ;

Comments can be delimited by either /* */ or * ; and can either precede or
follow the titles statements.

Within the titles text, any comment delimiter (/* */ * ;) can appear as
text, as well as unbalanced quotes.

Quotes can appear as either:

titles 'text "text" text';
titles "text 'text' text";
titles 'text ''text'' text';
titles "text ""text"" text";

In all these scenarios, I want to remove any comments from the text of the
titles block, but leave the code itself intact.

Actually, I can live with not covering all scenarios. The most common
"tricky" scenario would be an asterisk in titles text, with either comment
style following, eg;

titles "PROC FREQ output of var1*var2"; * var2 may have missing values ;
titles "PROC FREQ output of var1*var2"; /* var2 may have missing values */

BTW, what I am working on is something like POD, but for non-Perl files. My
script extracts structured text and builds a documentation file. Within the
program source code, this structured text can appear within the titles
statements, so the programmer only has to specify the text once - once for
the titles statement (executable code) and once for the "POD" output.
A whole extra level of complexity is introduced if you want to consider
both single and double quote characters as marking strings. And yet
another if there is some way to quote quote characters within quoted
strings (other than doubling).

You need to be clear in your mind what you want to do in all possible
cases before you can implement it.

Brian, thank you *so* much for the code you posted. Much appreciated.

Here is a more realistic test case:

use strict;
use warnings;

my @strings1 = (
'1titles "statement";',
'2titles "statement"; * comment ;',
"3titles '* ; statement';",
"4titles '* ; statement'; * comment ;",
'5titles "* ; statement"; * comment ;',
"* comment ; 6titles '* ; statement'; * comment ;",
'7titles "statement"; * Don\'t comment? ;',
"8titles '* ; statement'; * comment ; * another comment? ;",
q{9titles " * This isn't a comment is it?"; * comment ;},
);

my @strings2 = (
'Atitles "statement"; /* comment */',
"Btitles '/* */ statement';",
"Ctitles '/* */ statement'; /* comment */",
"/* comment */ Dtitles '/* */ statement'; /* comment */",
'Etitles "statement"; /* Don\'t comment? */',
'Ftitles "/* */ is this a comment";',
"Gtitles '/* */ statement'; /* comment */ /* another comment? */",
q{Htitles "/* This isn't a comment is it?" */},
);

for ( @strings1 ) {
1 while s/^((?:(?:[^']*'){2})*[^']*)\*[^;]*;/$1/;
print "$_\n";
}

for ( @strings2 ) {
1 while s/^((?:(?:[^']*'){2})*[^']*)\*[^;]*;/$1/;
print "$_\n";
}
__END__

The code you posted (with the addition of the loop) works for all scenarios
except #5 and #9 (see numbers added to titles statements above). As stated
above, double quotes are also valid delimiters, so I will need to figure out
how to add them to the RE. Scenarios #4 & #5 are the most common; I will
need to code for #5, but can live with some of the other scenarios failing.

I didn't add the scenarios in @strings2 in my previous posts because I was
hoping that, with a solution to @strings1, I could work out how to code
@strings2. My mistake, mea culpa.

The sad fact is, I've come to realize I'm over my head here regarding these
"fancy" regular expressions (as described here
http://www.unix.org.ua/orelly/perl/prog3/ch05_10.htm). I've got the various
O'Reilly Perl books, including Mastering Regular Expressions, but I'm going
to have to read, re-read, re-read, and hack around until I "master regular
expressions" (as the title suggests :-/)

I don't expect you to do my work, so I'll just have to study and hack around
with this until I get it to work.

I really, *really* appreciate your most helpful replies to my recent posts.

Kind Regards,
Scott
 
S

Scott Bass

"Scott Bass" <usenet739_yahoo_com_au> wrote in message

[snip]
Quotes can appear as either:

titles 'text "text" text'; ###
titles "text 'text' text"; ###
titles 'text ''text'' text';
titles "text ""text"" text";

Sorry, these are also valid:

titles 'text "unbalanced quotes';
titles "text 'unbalanced quotes"; <<<
titles 'text ''unbalanced quotes';
titles "text ""unbalanced quotes";

The ### lines above above are the most common, but all of these are valid.
The <<< is most often used with contractions in the title text.
 
B

Brian McCauley

Scott said:
"Scott Bass" <usenet739_yahoo_com_au> wrote in message

[snip]

Quotes can appear as either:

titles 'text "text" text'; ###
titles "text 'text' text"; ###
titles 'text ''text'' text';
titles "text ""text"" text";


Sorry, these are also valid:

titles 'text "unbalanced quotes';
titles "text 'unbalanced quotes"; <<<
titles 'text ''unbalanced quotes';
titles "text ""unbalanced quotes";

Well coping with those isn't two hard - quoting quotes within quotes by
doubling is easy to parse.

The following replaces all occurances of 'target' with 'replace' except
where it appears inside a quoted string.

s/\G((?:[^"']*(["'])*[^\2]*\2)*[^"']*)target/${1}replace/g;

In another branch of this thread I'll now go think about your real /target/.
 
B

Brian McCauley

[ Please see my first follow-up (confusingly a bit further down this
thread) first ]

Scott said:
Brian McCauley said:
[...] you must first consider the implications of
quote characters appearing between the * and ;. In the above there can be
unbalanced quotes between * and ; and the ; is still seen as the end of
the comment. Is this right?

Comments can be delimited by either /* */ or * ; and can either precede or
follow the titles statements.

Within the titles text, any comment delimiter (/* */ * ;) can appear as
text, as well as unbalanced quotes.

Yes, but you didn't answer my question about quote characters inside
comments.
'7titles "statement"; * Don\'t comment? ;',
'Etitles "statement"; /* Don\'t comment? */',

I'll assume from these examples the quote characters inside comments are
not treated as quotes.

Thus ignoring the small issue of ignoring comments within
quotes we have:

s/\/\*.*?\*\/|\*.*?;//g;

However if I join it all together...

s/\G((?:[^"']*(["'])*[^\2]*\2)*[^"']*)(?:\/\*.*?\*\/|\*.*?)/$1/g;

....it crashes the Perl compiler on this box (v5.8.4 built for
MSWin32-x86-multi-thread).

Oh well.
 
A

Anno Siegel

[...]
However if I join it all together...

s/\G((?:[^"']*(["'])*[^\2]*\2)*[^"']*)(?:\/\*.*?\*\/|\*.*?)/$1/g;

...it crashes the Perl compiler on this box (v5.8.4 built for
MSWin32-x86-multi-thread).

Same here (v5.8.6 built for darwin-2level). It doesn't segfault when the
second appearance of \2 is wrapped in a character class:

s/\G((?:[^"']*(["'])*[^\2]*[\2])*[^"']*)(?:\/\*.*?\*\/|\*.*?)/>$1</g;

That shouldn't change the semantics, but that's little comfort.

Anno
 
B

Brian McCauley

Anno said:
[...]

However if I join it all together...

s/\G((?:[^"']*(["'])*[^\2]*\2)*[^"']*)(?:\/\*.*?\*\/|\*.*?)/$1/g;

...it crashes the Perl compiler on this box (v5.8.4 built for
MSWin32-x86-multi-thread).


Same here (v5.8.6 built for darwin-2level). It doesn't segfault when the
second appearance of \2 is wrapped in a character class:

s/\G((?:[^"']*(["'])*[^\2]*[\2])*[^"']*)(?:\/\*.*?\*\/|\*.*?)/>$1</g;

That shouldn't change the semantics, but that's little comfort.

Maybe it's more comforting to get rid of the back reference:

s/\G((?:[^"']*(?:"[^"]*"|'[^']*'))*[^"']*)(?:\/\*.*?\*\/|\*.*?;)/$1/g;

(Note: I've put back the missing semicolon that was mysteriously lost in
my previous post).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top