Regex dissection

P

Peder Ydalus

Hi!
What is the purpose of the following regex:

$f =~ s#/\*.*?\*/##g;

- Peder -
 
E

Eric Schwartz

Peder Ydalus said:
What is the purpose of the following regex:

$f =~ s#/\*.*?\*/##g;

Deleting the contents of a C-style comment from $f. Probably. I
can't be arsed to come up with legal comments that would fail it, but
there probably are some. At the very least, this value of $f won't
work:

$f = "A /* multi-line\ncomment */";

I don't understand what the .*? bit is supposed to do... '.*' in this
context says, "grab every non-newline character following the original
literal * character", but then following it with ? either means
whoever wrote it didn't understand regexes, or got distracted in the
middle of editing one, or something really subtle is going on. :)

-=Eric
 
M

Martien Verbruggen

Deleting the contents of a C-style comment from $f. Probably. I
can't be arsed to come up with legal comments that would fail it, but
there probably are some. At the very least, this value of $f won't
work:

$f = "A /* multi-line\ncomment */";

The Perl FAQ gives a working solution:

$ perldoc -q comment

I don't understand what the .*? bit is supposed to do... '.*' in this
context says, "grab every non-newline character following the original
literal * character", but then following it with ? either means
whoever wrote it didn't understand regexes, or got distracted in the
middle of editing one, or something really subtle is going on. :)

The question mark makes the preceding * non-greedy. Given how broken
the original regex is to match C-style comments, the person who
created it probably just put it in to "fix" one of the bits that this
thing didn't "correctly" match. Or maybe they thought they were being
clever to allow things like:

/* some comment */ foo = 3; /* other comment */

Although, allowing that sort of thing, which would be exceedingly
rare, and failing to allow multi-line comments, which are very common,
is, of course, silly.

Since the whole regex is so broken, it's probably impossible to divine
the original intention of its creator for each part.

Martien
 
E

Eric Schwartz

Martien Verbruggen said:
The question mark makes the preceding * non-greedy.

Ah, thanks. That's one of the back-of-my-mind regex subtleties I
never use. Generally, if I find myself in a situation where .*? would
work, I rewrite the regex to be clearer. :)
Given how broken
the original regex is to match C-style comments, the person who
created it probably just put it in to "fix" one of the bits that this
thing didn't "correctly" match.

I did say "probably" in my reply-- it's possible, albeit
astronomically unlikely, that this is meant to operate on something
that looks a lot like a C-style comment, but isn't one really. But
yeah, your explanation is the most likely, now that you've reminded me
what the .*? means.

-=Eric
 
B

Bob Walton

Peder said:
Hi!
What is the purpose of the following regex:

$f =~ s#/\*.*?\*/##g;

- Peder -

That statement is more than a regexp -- it is a substitution. It will
substitute the null string for every occurrenace of /* followed by zero
or more of any character that is not a newline folowed by */ . Because
the "s" switch is not present, it will not match over a newline. It
looks like it is probably a lame attempt to delete "C"-style commentary.
Lame because it should have included the "s" switch so it would
deleted multiline commentary. And because it will also delete stuff
that appears to be commentary but is actually inside quoted strings in
the C program. For how to do it right, see:

perldoc -q comments

In gruesome detail here is what the statement does:

$f is a scalar value, from which comes the string to be matched and to
which the substituted string is stored.

=~ is the match operator

The "s" means a substitution follows, which consists of two parts: a
regexp and substitution string.

The "#" is the delimiter to be used in the substitution. There are
three delimiters, and they delimit the regexp and the substitution
string. Any delimiters in the regexp or the substitution string must be
\-quoted.

/ is a literal, will start out matching the first / in $f

\* is a literal * (must be escaped since * is a regexp metacharacter),
so the regexp so far will match the first /* in $f. It will not match
on something like xxx/xxx*xxx

.. will match any character other than a newline (since the "s" switch is
not present after the substitution operator).

The * after the . will cause the . to match zero or more occurrences of
any character other than a newline.

The ? after .* causes the .* to be non-greedy, so it will match the
shortest possible string of any characters other than newline, rather
than the longest possible such string, like it would have if the ? were
omitted. Without the ?, it would have matched the entirity of:

/*xxx*/yyy/*zzz*/

all in one match. With the ?, it will match /*xxx*/

Then the \* and the / are again literals which must be matched exactly.

The ## delimits the substitution string, which means the empty string
will replace the first /*...*/ in $f.

The "g" on the end means the substitution is to be applied globally to
$f -- that means after a successful match and substitution, the match
and substitution will be attempted again from that point in $f, and
repeated until the match eventually fails.

The ";" terminates the Perl statement.
 
G

Garry Heaton

Peder said:
Thanks a lot!
This was far more than I could ever have hoped for.

- Peder -

If you want the full gory details try Jeffrey Friedl's "Mastering Regular
Expressions" (2nd edition), pp.272-6.

Garry Heaton
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,145
Messages
2,570,824
Members
47,371
Latest member
Brkaa

Latest Threads

Top