Regular expression question.

L

L7

In trying to parse a C source file I have the following section of
code:

...
...
case line
when /^.*\/\*.*?\*\/.*$/ # single line comment(s)
non_comments = line.split(/\/\*.*?\*\//).to_s
process_code(non_comments)
when /^.*\/\*\*?[^(\*\/)]*$/ # multi-line start
comment = true
next
when /^[^(\/\*)]*\*\/.*$/ # multi-line end
comment = false
...
...

I am running into a problem with the multi-line comment sections.
While something like:

/*
A comment
*/

will work (i.e. gets properly parsed out)

/* A
* comment */

OR

/* A *
* comment */

will not.
My guess is that it is because of the [^(\*\/)] construct blocking the
leading or trailing '*' character. However, I thought that by placing
the \*\/ within parenthesis I avoided the characters being evaluated
individually.
Is there a way to look for the pattern '*/' without having a single '*'
break the search?
As an alternative, I use this:

when /^.*\/\*\*?[^(\*\/)]*\**?$/
comment = true
next
when /^.*?[^(\/\*)]*\*\/.*$/
comment = false

Which *seems* to solve the problem, but I can see where it is brittle

/* A * comment
for * instance */

Any suggestions?
Thanks in advance.
 
J

Jeff Cohen

L7 said:
In trying to parse a C source file I have the following section of
code:

...
...
case line
when /^.*\/\*.*?\*\/.*$/ # single line comment(s)
non_comments = line.split(/\/\*.*?\*\//).to_s
process_code(non_comments)
when /^.*\/\*\*?[^(\*\/)]*$/ # multi-line start
comment = true
next
when /^[^(\/\*)]*\*\/.*$/ # multi-line end
comment = false
...
...

I am running into a problem with the multi-line comment sections.

My eyes glaze over with these kinds of expressions, but this might help:

http://www.regularexpressions.info/examplesprogrammer.html

Scroll down the section on "Comments". They seem to have a simpler
solution, I think the trick is to be able to use . as matching newlines.

And you can turn on newline matching in Ruby by putting an "m" after the
expression:

/my_pattern_here/m

Hope this helps...?

Jeff
softiesonrails.com
 
L

L7

Jeff said:
My eyes glaze over with these kinds of expressions, but this might help:

http://www.regularexpressions.info/examplesprogrammer.html

Scroll down the section on "Comments". They seem to have a simpler
solution, I think the trick is to be able to use . as matching newlines.

I dont think it applies to this directly. I didnt explicitly mention,
but the processing is happening on a line-by-line basis. In order to
remove all commenting in the above manner I would first have to read
the file as a string, strip, split on newline then parse code.
 
F

Francis Cianfrocca

In trying to parse a C source file I have the following section of
code:
Remember that in C, nested comment-blocks are not permitted, for the
incredibly good reason that they are not recognizable by
regular-expressions ;-). Why don't you take a pre-pass through your C
file and take out the comments yourself before you run your main
parse? A recursive-descent parser to do the job would probably take
almost no code at all in Ruby.
 
L

L7

Francis said:
Remember that in C, nested comment-blocks are not permitted, for the
incredibly good reason that they are not recognizable by
regular-expressions ;-).

Agreed. However, something with '*' characters in it is allowed (so
long as they are not preceeded or followed directly by '/') and that is
where I would get clobbered.
Why don't you take a pre-pass through your C
file and take out the comments yourself before you run your main

As I mentioned, that involved a bit of overhead. But with regard to the
project, I assume it is the 'best fix' to what I have.
 
R

Rod Knowlton

In trying to parse a C source file I have the following section of
code:

...
...
case line
when /^.*\/\*.*?\*\/.*$/ # single line comment(s)
non_comments = line.split(/\/\*.*?\*\//).to_s
process_code(non_comments)
when /^.*\/\*\*?[^(\*\/)]*$/ # multi-line start
comment = true
next
when /^[^(\/\*)]*\*\/.*$/ # multi-line end
comment = false
...
...



Is there a way to look for the pattern '*/' without having a single
'*'
break the search?

If I'm not mistaken, what you need is a negative lookahead

try /^.*\/\*([^\/]|\/(?!\*))*$/ for multi-line start

and /^([^\*]|\*(?!\/))*\*\/.*$/ for multi-line end

the key difference (from the start pattern) is ([^\/]|\/(?!\*))

this breaks down like so:

(
[^\/] # anything but /
| # or
\/(?!\*) # a / not followed by an * (don't eat the character after /,
just peek at it)
)

The pattern for multi-line end uses the same technique, but with the
characters reversed.

I'm sure this isn't the be all and end all of C comment matching
regexs, but it handles all of the cases you described.

- Rod
 
T

Tom Copeland

I am intrigued, I believe that the regular expression to find all comments
in C must be very complex and probably not the correct tool, look at these
snipplets

// /*
if(strcmp(x,"*/")
// "*/
etc. etc.

I'm not sure if it's impossible to parse out C-style comments using a
regular expression, but the various JavaCC grammars I've seen all use
lexical states to do it instead. Another complication is trigraphs (*),
although I think those are unrecognized by default in most C
preprocessors.

Yours,

Tom

(*) http://en.wikipedia.org/wiki/C_trigraph
 
L

Logan Capaldo

One more point. Someone upthread gave an example similar to this:

/* printf ("*/"); */
Pretty sure this would end up being a syntax error
Considered strictly as a lexical construction, I think this is regular.
However, I have a funny feeling that this:

/* printf ("/*......*/"); */
This too.

gcc agrees with me at least:

% cat comments.c
#include <stdio.h>

int main(int argc, char **argv) {
/* printf("*/"); */
/* printf("/*.......*/"); */
return 0;
}
% gcc -c comments.c
comments.c: In function 'main':
comments.c:4: error: missing terminating " character
comments.c:5: error: missing terminating " character
is actually context-free. Does anyone know for sure?
As for whether or not its context free, I don't know, but I think you
overestimated how hard C tries. /* */ are not nestable for instance.
 
L

Logan Capaldo

I know these are syntax errors in C. I was talking about a hypothetical
language (not C) that defined such constructs as legal. I'm still not sure
that it's impossible to use a regular language to generate this case:
/* "*/ */
I'm pretty convinced that the other case requires a context-free language.
Well for empirical evidence one could look at ML. (* comments (* are *)
nestable *).
 
D

Daniel Martin

Francis Cianfrocca said:
One more point. Someone upthread gave an example similar to this:

/* printf ("*/"); */

Considered strictly as a lexical construction, I think this is regular.
However, I have a funny feeling that this:

/* printf ("/*......*/"); */

is actually context-free. Does anyone know for sure?

So you want to know if a grammar is regular or not? Sounds like you
need the Myhill-Nerode theorem
(http://en.wikipedia.org/wiki/Myhill-Nerode_theorem).

And according to that, a language that allows arbitrary nesting of
comment expressions like this is indeed not regular, and therefore not
parseable with regular expressions as traditionally defined in
computer science. To parse arbitrarily nested constructs you either
need something like perl's evaluate-code-at-regexp-match-time feature
(which so far as I know exists in no other language), or an actual
grammar. (or anything else that can get as complicated
computationally as a pushdown automaton)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,212
Messages
2,571,101
Members
47,695
Latest member
KayleneBee

Latest Threads

Top