replace c-style comments with newlines (regexp)

L

lex __

I'm tryin to use regexp to replace multi-line c-style comments (like /* this /n */ ) with /n (newlines).
I tried someting like re.sub('/\*(.*)/\*' , '/n' , file)
but it doesn't work for multiple lines.

besides that I want to keep all newlines as they were in the original file, so I can still use the original linenumbers (I want to use linenumbers as a reference for later use.)
I know that that will complicate things a bit more, so this is a bit less important.

background: I'm trying to create a 'intelligent' source-code security analysis tool for c/c++ , python and php files, but filtering the comments seems to be the biggest problem. :(

So, if you have an answer to this , please let me know how to do this!

thanks in advance,
- Alex



_________________________________________________________________
Download de nieuwe Windows Live Messenger!
http://get.live.com/messenger/overview
 
S

Steven D'Aprano

I'm tryin to use regexp to replace multi-line c-style comments (like /*
this /n */ ) with /n (newlines). I tried someting like
re.sub('/\*(.*)/\*' , '/n' , file) but it doesn't work for multiple
lines.


Regexes won't cross line boundaries unless you make them multiline with
re.MULTILINE.

Also, I'm no expert on regexes, but it looks to me that your regex is
greedy. I think you need the non-greedy version, which by memory (and
completely untested) is something like this:

rx = re.compile('/\*(.*?)/\*', re.MULTILINE)


Have you considered what happens when your C code includes a string
literal containing '/*'?


"Some people, when confronted with a problem, think “I know, I’ll use
regular expressions.†Now they have two problems."
-- Jamie Zawinski, in comp.lang.emacs
 
P

Peter Otten

Regexes won't cross line boundaries unless you make them multiline with
re.MULTILINE.

re.MULTILINE affects the behaviour of ^ and $, the relevant flag is re.DOTALL:
Also, I'm no expert on regexes, but it looks to me that your regex is
greedy. I think you need the non-greedy version, which by memory (and
['a', 'b\nb', 'c/*c']
.... return "\n" * match.group(1).count("\n")
.... 'A BB \n CCC '
Have you considered what happens when your C code includes a string
literal containing '/*'?

Indeed.

Peter
 
N

Neil Cerutti

I'm tryin to use regexp to replace multi-line c-style comments
(like /* this /n */ ) with /n (newlines). I tried someting
like re.sub('/\*(.*)/\*' , '/n' , file) but it doesn't
work for multiple lines.

besides that I want to keep all newlines as they were in the
original file, so I can still use the original linenumbers (I
want to use linenumbers as a reference for later use.) I know
that that will complicate things a bit more, so this is a bit
less important.

background: I'm trying to create a 'intelligent' source-code
security analysis tool for c/c++ , python and php files, but
filtering the comments seems to be the biggest problem. :(

So, if you have an answer to this , please let me know how to
do this!

There are free C lexers and parsers available (e.g., gcc). I
recommend them to you. Gluing a real C parser into your Python
code might be easier than writing one. Not that it's impossible
to discover C comments with your own special-purpose, simple
parser (see Exercise 1-23 in K&R _The C Programming Language 2nd
Edition_), but it's not remotely doable with a regex.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top