Remove the comments and excess white space in C source code

E

Eric Sosman

Hi guys,
I wrote a program to strip excess white spaces and comments from C
source code. Please check it out. Any comments would be appreciated.

https://github.com/fangfufu/C-unformatter

I took only a brief look, so my remarks may be incomplete.
In no particular order:

- Interchange lines 219-221 with 222-224.

- The tests at line 246 are wrong because of the `char' type.

- I think check_preprocessor_statements() will fail if the
source starts with white space (e.g., newlines) followed by '#'.

- I think lines 168-173 will mess up source constructs like
`x = y / *ptr;', turning them into (unterminated) comments.

- Trigraph sequences and "digraphs" aren't handled properly.
(This could be considered a feature rather than a bug.)

- I don't think white space before what you call "tokens" will
be removed. For example, it looks like `puts ( "Hello" ) ;' will
become `puts ("Hello");' rather than `puts("Hello");'.

- Lines that end with a backslash-newline pair aren't handled
properly.

- Lines 119-141 are a *terrible* idea! One crummy little
I/O error (or bug!), and you can kiss your source code good-bye!

- Speaking of I/O errors, rip_file() is careful to detect
them but not so careful about closing FILE streams afterward.
(In fact, it never closes the overwritten input which is its
principal output, so never gets a chance to detect errors in
closing -- but since the original source is already trashed by
then it may not make much difference. Even if all goes well,
though, rip_file() leaks an open FILE stream for each source it
processes; feed it enough sources and it may well run out.)
(Hmmm: I wonder what happens if you mention the same source
file name twice on the command line ...)

- Higher-level remark: I think the program might be simpler
if re-cast as a state machine, instead of spreading the logic
across a whole bunch of brittle-looking functions. ("Brittle"
because there's always this question about whether the function
has or has not swallowed the current character, and perhaps more;
that's the sort of thing that's easy to lose track of.) This looks
more like a job for one simple loop surrounding a big `switch'
statement, with cases corresponding to the current context.
 
T

Tim Rentsch

F.F. said:
Hi guys,
I wrote a program to strip excess white spaces and comments from C
source code. Please check it out. Any comments would be appreciated.

In addition to Eric Sosman's list (and overlap in some
cases), I would list these problems:

1. Some spaces that can be taken out aren't.

2. Some cases where spaces must be left in are not,
eg, return/**/ 0;

3. Comments are not removed from preprocessor
directives.

4. Line boundaries ignored when deciding whether
a '#' starts a preprocessor directive.

5. Preprocessor directives after regular program
text don't have a newline inserted before them.
Or apparenly only sometimes don't, eg

int main(){
#define FOO 1

misbehaves.

6. There needs to be a final newline added if the
last output line is non-empty (which it almost
always will be in real programs).

7. The formatting program generally assumes its
input is well-formed C source, with little or
no effort to detect bad input.

8. Approach is generally too simplistic to be
completely effective, especially if it matters
what happens with spaces in macro expansions,
which it does in some programs because of how
the stringizing operator works.
 
T

Tim Rentsch

Eric Sosman said:
[how to process C source to remove spaces, comments, etc]

- Higher-level remark: I think the program might be simpler
if re-cast as a state machine, instead of spreading the logic
across a whole bunch of brittle-looking functions. ("Brittle"
because there's always this question about whether the function
has or has not swallowed the current character, and perhaps more;
that's the sort of thing that's easy to lose track of.) This looks
more like a job for one simple loop surrounding a big `switch'
statement, with cases corresponding to the current context.

That turns out to be a lot harder than it might seem, because of
interactions between the different levels of textual processing
(trigraphs, line splicing, comments, preprocessor lines, etc),
not to mention the question of when adjacent tokens can be
safely agglutinated.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top