How to remove // comments

Mark McIntyre · Oct 23, 2006

Mark said:
Mark said:

]
No sane person is going to invent a new character encoding
that doesn't include ASCII.
Apparently nobody told IBM.
It's unlikely *now* that anyone would invent a new encoding that's not
based on ASCII.

Click to expand...

I'm not even sure that's true. I can see the Chinese deciding on some
totally new encoding scheme more suitable for their needs.

Click to expand...

If their needs don't include communicating with the rest of the world

One could argue, that since there's more of them than us, we should
adapt...

or the internet

Puhleeze. There are already many thousands of websites which are paged
entirely exclusively in non-ASCII. In a few years, I predict a
majority of websites will have non-ASCII names.

or using the C, C++, Perl, Java, Ruby, Python, or D
programming languages, then they should go for it.

It may surprise you to learn this, but nations using Western lettering
are in a minority.
--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan

CBFalconer · Oct 23, 2006

Walter said:
.... snip ...

3) Please explain how C99 makes it possible to make a conforming C
implementation for RADIX50 encoding,
http://en.wikipedia.org/wiki/RADIX-50.

Assuming you meant 'impossible', RADIX-50 can only hold 40
characters, 26 alpha, 10 numeric, space, and three others. No room
for the fundamental C char set.

Jalapeno · Oct 23, 2006

Keith said:
Fascinating. There have been raging arguments about trigraphs both
here and in comp.std.c for years. I think you're the first person
I've seen who actually *uses* them. Maybe mainframe users just don't
post to Usenet very often?

The first? Wow. I can't speak for anyone but myself. I came to Usenet
looking for information and people interested in old hardware, and
"discovered" comp.lang.c as a side effect. C isn't and never was the
most popular way to program mainframes. There are large code bases but
they are miniscule compared to the COBOL and PL/I code bases. In the
early 1980's we started using Pascal but it died fairly quickly. I have
seen a lot of C code on mainframes that is nothing more than "portable
assembler". The specific nature of the coding techniques for the MVS
system would make comp.lang.c fairly useless as a resource to those
programmers, I suppose.

In my own experience, and that of most people here, trigraphs have
caused far more problems than they solve; if a trigraph appears in a C
source file, it's far more likely to be accidental than intentional
(unless the code is deliberately obfuscated). For example:

fprintf(stderr, "Unexpected error, what happened??!\n");

When I first started in C octal numbers caused some subtle bugs. ;o)

Since there is currently no active effort to publish a new C standard,
it looks like we're stuck with the current situation for the
forseeable future, but some of us are still trying to come up with a
better solution. For example, I've proposed *disabling* trigraphs by
default, but enabling them if there's some unique marker at the top of
the file.

For any change like this, there's a danger of breaking existing code,
but for those of us outside the IBM mainframe world, it would probably
accidentally *fix* more code than it would break.

I have neither a love nor hate for trigraphs. They are just the syntax
used. I originally responded to a poster who said he had never seen
trigraphs outside of a test suite. I have. That doesn't mean I advocate
using them. But they are in use.

Also, why do you use trigraphs rather than digraphs? They were added
in a 1995 update to the standard (I think that's right); you could
write a[8] as a<:8:> rather than as a??(8??).

Any thoughts?

Well, why didn't you tell me in 1995? ;o) Looking at the docs for
the compiler (which is C92 compliant, i.e ANSI/ISO 9899:1990[1992]
(formerly ANSI X3.159-1989 C)) digraphs are available but the default
compiler switch is NODIGRAPH. So, since apparently nobody who has
worked here knew of digraphs, the compiler switch was never turned on.
IBM claims their newest compiler is C99 compliant, but it requires an
operating system upgrade to at least z/OS 1.7 to use that compiler. We
won't be upgrading the OS for at least another year.

Really, it is all just syntax. I got used to them and can go back and
forth without any trouble. YMMV, of course. Like anything in C, if you
know the pitfalls, it's easier to avoid them.

Jalapeno · Oct 23, 2006

Walter said:
I understand that. My (badly explained) point was that since trigraphs
failed to make C source code portable, trigraphs shouldn't have been
part of the C standard.

I am not sure I understand your point. Portability is supposed to be a
two way street.

On the IBM mainframe, the 3270 terminal (really 91.9% is terminal
emulation on windows these days) does not have certain characters from
the C basic execution character set. The 3270 has many (IMO) better
characters.

EBCDIC however has, for instance, the '[' and ']' symbols in its set of
characters. They are there. It isn't a translation problem per se. It
is just that when the C standard was being formulated there was no way
to type them from a 3270 terminal.

There was absolutely no problem taking C source code from Unix or
Windows, for example, and translating the ASCII to EBCDIC and compiling
the source. Trigraphs mean that source typed in on a 3270 can be sent
to a Unix system via EBCDIC to ASCII translation and still compile
without having to edit the source. (system specific parts excepted)

I am not advocating trigraphs. I do see your point. There were
realities in the hardware in the 1980's and 1990's that were there. I
am sure IBM had a presence with the Standards committee.

Just understand that my whole existence in this thread is because you
said you had never seen trigraphs outside a test suite. They do exist.
It is legacy code, I know, but it is there. And it is updated
periodically.

jxh · Oct 23, 2006

CBFalconer said:
... snip code ...

If you just want to delete all comments, my public domain uncmnt.c
is considerably shorter. ...

<http://cbfalconer.home.att.net/download/>

Very nice. It doesn't handle other cases besides trigraphs, though.

Keith Thompson · Oct 23, 2006

Mark McIntyre said:
Mark said:

On Sat, 21 Oct 2006 19:35:34 GMT, in comp.lang.c , Keith Thompson

]
No sane person is going to invent a new character encoding
that doesn't include ASCII.
Apparently nobody told IBM.
It's unlikely *now* that anyone would invent a new encoding that's not
based on ASCII.

I'm not even sure that's true. I can see the Chinese deciding on some
totally new encoding scheme more suitable for their needs.

Click to expand...

If their needs don't include communicating with the rest of the world

Click to expand...

One could argue, that since there's more of them than us, we should
adapt...

or the internet

Click to expand...

Puhleeze. There are already many thousands of websites which are paged
entirely exclusively in non-ASCII. In a few years, I predict a
majority of websites will have non-ASCII names.

Obviously the encodings used for Chinese and/or Japanese characters
are non-ASCII, but are they necessarily *incompatible* with ASCII?
Chinese in particular has a *lots* of characters it has to represent;
reserving the first 128 codes for ASCII (including digits and
punctuation marks, which can be used in Chinese text) doesn't seem too
onerous.

Unicode is a superset of ASCII, and it can represent Chinese
characters easily enough. *If* it catches on world-wide, we can
continue to assume that the ASCII subset needed by C will be
available.

Mark McIntyre · Oct 23, 2006

Obviously the encodings used for Chinese and/or Japanese characters
are non-ASCII, but are they necessarily *incompatible* with ASCII?

Quite possibly not, although people have in the past been known to
deliberately write for incompatibility, due to personal, commercial or
nationalistic reasons. This is however probably offtopic in CLC...
--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan

Keith Thompson · Oct 23, 2006

Jalapeno said:
EBCDIC however has, for instance, the '[' and ']' symbols in its set of
characters. They are there. It isn't a translation problem per se. It
is just that when the C standard was being formulated there was no way
to type them from a 3270 terminal.

Really? My understanding is that there are multiple versions of
EBCDIC, some of which *don't* have '[' and ']' characters. Wikipedia
<http://en.wikipedia.org/wiki/EBCDIC> shows a table of something
called CCSID 500, which does have '[' and ']', along with accented
characters (which, if I understand correctly, "classic" EBCDIC didn't
have).

CBFalconer · Oct 23, 2006

jxh said:
CBFalconer wrote:
.... snip ...

Very nice. It doesn't handle other cases besides trigraphs, though.

What do you see missing? Apart from trigraphs.

Keith Thompson · Oct 23, 2006

Mark McIntyre said:
Quite possibly not, although people have in the past been known to
deliberately write for incompatibility, due to personal, commercial or
nationalistic reasons. This is however probably offtopic in CLC...

It's not entirely off-topic. The future evolution of character sets
could have a major effect on future C standards. If we can't assume,
for example, that the '?' character will always be available, we'll
have to think about alternatives. Though there's probably not much
point in inventing specific solutions until and unless we see an
actual character set that *doesn't* have '?', and that people want to
use to write C programs.

Walter Bright · Oct 23, 2006

Mark said:
One could argue, that since there's more of them than us, we should
adapt...

You can argue that. But don't expect to be taken seriously. The Chinese
and Japanese regularly mix in western letters in their web pages, books,
and magazines.

You're suggesting that we (and the Chinese) should throw out the entire
computer infrastructure, and rewrite/rebuild everything from scratch.

Puhleeze. There are already many thousands of websites which are paged
entirely exclusively in non-ASCII. In a few years, I predict a
majority of websites will have non-ASCII names.

The internet encodings are all supersets of ascii. That is not going to
change.

It may surprise you to learn this, but nations using Western lettering
are in a minority.

How can a C99 compiler work with totally non-western lettering?

Walter Bright · Oct 23, 2006

Jalapeno said:
Just understand that my whole existence in this thread is because you
said you had never seen trigraphs outside a test suite. They do exist.
It is legacy code, I know, but it is there. And it is updated
periodically.

I am not advocating removing trigraphs from the standard - what's done
is done. And I appreciate you joined in to say there are real trigraph
uses in the wild.

Walter Bright · Oct 23, 2006

CBFalconer said:
Walter Bright wrote:
... snip ...

Assuming you meant 'impossible', RADIX-50 can only hold 40
characters, 26 alpha, 10 numeric, space, and three others. No room
for the fundamental C char set.

Exactly. Trigraphs don't make C future proofed against arbitrary future
character encodings that don't have ascii as a subset.

Jalapeno · Oct 24, 2006

Keith said:
Jalapeno said:

EBCDIC however has, for instance, the '[' and ']' symbols in its set of
characters. They are there. It isn't a translation problem per se. It
is just that when the C standard was being formulated there was no way
to type them from a 3270 terminal.

Click to expand...

Really? My understanding is that there are multiple versions of
EBCDIC, some of which *don't* have '[' and ']' characters. Wikipedia
<http://en.wikipedia.org/wiki/EBCDIC> shows a table of something
called CCSID 500, which does have '[' and ']', along with accented
characters (which, if I understand correctly, "classic" EBCDIC didn't
have).

Now you're making me get out the archives I see

) Ok, the oldest
"green card" I have is from when I graduated from college and got my
first job in an IBM mainframe shop. Jan of 1979. I don't know if this
table is what you'd call "classic" EBCDIC but it was the version being
used in the USA for IBM, Amdahl, Hitachi, and National Advanced Systems
mainframes in January of 1979. It clearly shows '[' as decimal 173 and
hex AD, and ']' as decimal 189 and hex BD in the EBCDIC column of the
table. It has no characters in the BCDIC column in that range and
nothing in the ASCII column. My archives don't go back any farther than
that but 1979 is clearly prior to the formation of the standards
commitee. So EBCDIC had the characters in its set at least since
01/1979. The 3270 still doesn't have them on its keyboard. However,
having said that, I do not have access to an APL keyboard anymore so it
is possible that EBCDIC having those characters in its set may be
related to APL and its history. Someone else will have to answer that
question

) Wikipedia says this:

http://en.wikipedia.org/wiki/APL_(programming_language)

So based on that article, which clearly shows the '[' and ']' I am
going to guess that by "classic" EBCDIC you may have meant BCDIC.

I never did program in APL on an IBM mainframe but I used to see at
least one or two APL keyboards in every shop I worked in in the '80's
and '90's. I did have one class in APL in college but it was on a
CYBER, not an IBM.

Keith Thompson · Oct 24, 2006

Jalapeno said:
Now you're making me get out the archives I see ) Ok, the oldest
"green card" I have is from when I graduated from college and got my
first job in an IBM mainframe shop. Jan of 1979. I don't know if this
table is what you'd call "classic" EBCDIC but it was the version being
used in the USA for IBM, Amdahl, Hitachi, and National Advanced Systems
mainframes in January of 1979. It clearly shows '[' as decimal 173 and
hex AD, and ']' as decimal 189 and hex BD in the EBCDIC column of the
table. It has no characters in the BCDIC column in that range and
nothing in the ASCII column. My archives don't go back any farther than
that but 1979 is clearly prior to the formation of the standards
commitee. So EBCDIC had the characters in its set at least since
01/1979. The 3270 still doesn't have them on its keyboard. However,
having said that, I do not have access to an APL keyboard anymore so it
is possible that EBCDIC having those characters in its set may be
related to APL and its history. Someone else will have to answer that
question ) Wikipedia says this:

http://en.wikipedia.org/wiki/APL_(programming_language)

So based on that article, which clearly shows the '[' and ']' I am
going to guess that by "classic" EBCDIC you may have meant BCDIC.

I actually have very little idea of what I meant by "classic" EBCDIC;
your guess is probably better than mine. My ignorance on this topic
is vast.

In this context, I suppose the most relevent version is whatever
influenced the ANSI C committee back in the 1980s. But at that time,
I think alternate ASCII-oid codes were at least as significant in
influencing the introduction of trigraphs; some national character
sets replaced some of the ASCII punctuation marks with things like
accented characters and currency symbols. I think these now have
largely been replaced by the 8-bit ISO-8859-* encodings, and by
Unicode et al.

One more data point: Unix-like systems have a command called "dd" that
converts and copies files. Some of the conversions it specifies are:

`ascii'
Convert EBCDIC to ASCII, using the conversion table specified
by POSIX. This provides a 1:1 translation for all 256 bytes.

`ebcdic'
Convert ASCII to EBCDIC. This is the inverse of the `ascii'
conversion.

`ibm'
Convert ASCII to alternate EBCDIC, using the alternate
conversion table specified by POSIX. This is not a 1:1
translation, but reflects common historical practice for `~',
`[', and `]'.

The `ascii', `ebcdic', and `ibm' conversions are mutually
exclusive.

"dd conv=ebcdic" translates '[' and ']' to 0x4a and 0x5a, respectively.
"dd conv=ibm" translates '[' and ']' to 0xad and 0xbd, respectively.

Jordan Abel · Oct 24, 2006

2006-10-23 said:
It's not entirely off-topic. The future evolution of character sets
could have a major effect on future C standards. If we can't assume,
for example, that the '?' character will always be available, we'll
have to think about alternatives. Though there's probably not much
point in inventing specific solutions until and unless we see an
actual character set that *doesn't* have '?', and that people want to
use to write C programs.

The C standard does not define the graphical representation of any of
the characters the language uses. So for it to be an issue, we would
have to see an actual character set that has fewer than 98 characters.

Suppose a character set lacked ? but had $ - then we could define the
$ character as having the meaning of the ? in C. In source code
interchange, the string literal "Hello?" might become @Hello$@

62 a-zA-Z0-9
29 !#%^&*()[]{};':",.<>/?~\|-_=+ 29
9 \a\b\f\n\r\t\v \0

62+29+9 = 98 unique values required for C. And since C requires an [at
least] 8-bit type for char anyway, any system that used less wouldn't be
able to use its native character representation for C purposes anyway.

Jalapeno · Oct 24, 2006

Keith said:
The `ascii', `ebcdic', and `ibm' conversions are mutually
exclusive.

I looked up the CCSID numbers to see what the 500 code page was that
was in the wikipedia article. Based on this link below, I think the
acronym EBCDIC means many things

)

http://www-306.ibm.com/software/globalization/ccsid/ccsid_registered.jsp

I am done now

)

CBFalconer · Oct 24, 2006

jxh said:
CBFalconer wrote:
.... snip ...

Very nice. It doesn't handle other cases besides trigraphs, though.

I posted recently asking where it failed, and got no replies. I
did discover one case and corrected that. The revised code has
been posted at the above URL. It should be easily revised to
convert comments to the portable format, and I plan to do that
sometime real soon now.

As it stands I believe it is useful in generating cloaked source.
It can remove comments, id2id-20 (at same url) can revise names,
and a further utility (justify, not published) can handle the
rest. As it stands justify doesn't detect quoted strings, which
could cause problems. I may create justifyc to handle this, when
cloaking will reduce to a supervisory script. So far I have used
these things to create valid but obscure answers to homework
requests.

One more useful thing for cloaking would be filters to entrigph and
detrigph.

All of this points out the advantage of writing fully portable
source to the C90 standard. Without that you have very few
guarantees that the eventual output source remains valid on the
purchasers system.

jxh · Oct 25, 2006

CBFalconer said:
I posted recently asking where it failed, and got no replies. ...

It fails the split comment cases, such as these:

/\
* this is a comment */

/\
/ this is a comment too

Also from the previous thread, I learned about not messing with the
preprocessor
directives, so both yours and mine failed cases like:

#define COMMENT_START /* blah blah blah
#define COMMENT_END blah blah blah */

Of course, keep in mind corner cases like:

/* hey */ #define FOO \
/* bzzt */

I have fixed my program to properly deal with preprocessor directives.

--
- James

/*
* cstripc: A C program to strip comments from C files.
* Usage:
* cstripc [file [...]]
* cstripc [-t]
*
* The '-t' options is used for testing. It prints some pointers
* to strings that are interlaced with comment characters.
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/*****************/
/**** GLOBALS ****/
/*****************/

static const char *progname;
static int debug_flag;

/**********************/
/**** MAIN PROGRAM ****/
/**********************/

static void print_usage(void);
static void print_test(void);

static FILE * open_input_file(const char *filename);
static void close_input_file(FILE *infile);
static void parse_input_file(FILE *infile);

int
main(int argc, char *argv[])
{
progname = argv[0];
if (progname == 0) {
progname = "cstripc";
}

while (argc > 1) {

if ((*argv[1] != '-') || (strcmp(argv[1], "-") == 0)) {
break;
}

if (strcmp(argv[1], "-t") == 0) {
print_test();
exit(0);
} else if (strcmp(argv[1], "-d") == 0) {
debug_flag = 1;
} else {
fprintf(stderr, "%s: Unrecognized option '%s'\n",
progname, argv[1]);
print_usage();
exit(EXIT_FAILURE);
}

--argc;
++argv;
}

if (argc <= 1) {
parse_input_file(stdin);
exit(0);
}

while (argc > 1) {
FILE *infile;

parse_input_file(infile = open_input_file(argv[1]));
close_input_file(infile);

--argc;
++argv;
}

return 0;
}

/**************************/
/**** PRINT USAGE/TEST ****/
/**************************/

static const char *usage_string =
"%s: A C program to strip comments from C files.\n"
"Usage:\n"
" %s [file [...]]\n"
" %s [-t]\n"
"\n"
"The '-t' options is used for testing. "
"It prints some pointers to strings\n"
"that are interlaced with comment characters.\n"
;

static void
print_usage(void)
{
fprintf(stderr, usage_string, progname, progname, progname);
}

static const char *a;
static const char *b;
static const char *c;

static void
print_test(void)
{
if (a) puts(a);
if (b) puts(b);
if (c) puts(c);
}

/*******************************/
/**** OPEN/CLOSE INPUT FILE ****/
/*******************************/

static const char *input_file_name;

static FILE *
open_input_file(const char *filename)
{
FILE *infile;

input_file_name = filename;

if (filename == 0) {
return 0;
}

if (strcmp(filename, "-") == 0) {
return stdin;
}

infile = fopen(filename, "r");
if (infile == 0) {
fprintf(stderr, "%s: Could not open '%s' for reading.\n",
progname, filename);
}

return infile;
}

static void
close_input_file(FILE *infile)
{
if (infile) {
if (infile != stdin) {
if (fclose(infile) == EOF)
fprintf(stderr, "%s, Could not close '%s'.\n",
progname, input_file_name);
} else {
clearerr(stdin);
}
}
}

/**************************/
/**** PARSE INPUT FILE ****/
/**************************/

typedef struct scan_state scan_state;
typedef struct scan_context scan_context;

struct scan_context {
const scan_state *ss;
char *sbuf;
unsigned sbufsz;
unsigned sbufcnt;
int bol;
};

struct scan_state {
const scan_state *(*scan)(scan_context *ctx, int input);
const char *name;
};

static scan_context initial_scan_context;

static void
parse_input_file(FILE *infile)
{
int c;
scan_context ctx;

if (infile == 0) {
return;
}

ctx = initial_scan_context;

while ((c = fgetc(infile)) != EOF) {
if (debug_flag) {
fprintf(stderr, "%s\n", ctx.ss->name);
}
ctx.ss = ctx.ss->scan(&ctx, c);
}
}

/***********************/
/**** STATE MACHINE ****/
/***********************/

/*
*
*********************************************************************
* Assume input is a syntactically correct C program.
*
* The basic algorithm is:
* Scan character by character:
* Treat trigraphs as a single character.
* If the sequence does not start a comment, emit the sequence.
* Otherwise,
* Scan character by character:
* Treat trigraphs as a single character.
* Treat the sequence '\\' '\n' as no character.
* If the sequence does not end a comment, continue consuming.
* Otherwise, emit a space, and loop back to top.
*********************************************************************
*
*/

#define SCAN_STATE_DEFINE(name) \
static const scan_state * name##_func(scan_context *ctx, int input); \
static const scan_state name##_state = { name##_func, #name }

SCAN_STATE_DEFINE(normal);
SCAN_STATE_DEFINE(normal_maybe_tri_1);
SCAN_STATE_DEFINE(normal_maybe_tri_2);
SCAN_STATE_DEFINE(normal_maybe_splice);
SCAN_STATE_DEFINE(string);
SCAN_STATE_DEFINE(string_maybe_tri_1);
SCAN_STATE_DEFINE(string_maybe_tri_2);
SCAN_STATE_DEFINE(string_maybe_splice);
SCAN_STATE_DEFINE(char);
SCAN_STATE_DEFINE(char_maybe_tri_1);
SCAN_STATE_DEFINE(char_maybe_tri_2);
SCAN_STATE_DEFINE(char_maybe_splice);
SCAN_STATE_DEFINE(slash);
SCAN_STATE_DEFINE(slash_maybe_tri_1);
SCAN_STATE_DEFINE(slash_maybe_tri_2);
SCAN_STATE_DEFINE(slash_maybe_splice);
SCAN_STATE_DEFINE(slashslash);
SCAN_STATE_DEFINE(slashslash_maybe_tri_1);
SCAN_STATE_DEFINE(slashslash_maybe_tri_2);
SCAN_STATE_DEFINE(slashslash_maybe_splice);
SCAN_STATE_DEFINE(slashsplat);
SCAN_STATE_DEFINE(slashsplat_splat);
SCAN_STATE_DEFINE(slashsplat_splat_maybe_tri_1);
SCAN_STATE_DEFINE(slashsplat_splat_maybe_tri_2);
SCAN_STATE_DEFINE(slashsplat_splat_maybe_splice);
SCAN_STATE_DEFINE(preproc);
SCAN_STATE_DEFINE(preproc_maybe_tri_1);
SCAN_STATE_DEFINE(preproc_maybe_tri_2);
SCAN_STATE_DEFINE(preproc_maybe_splice);

#define SCAN_STATE(name) (&name##_state)

static scan_context initial_scan_context = {
SCAN_STATE(normal), 0, 0, 0, 1
};

static void sbuf_append_char(scan_context *ctx, int c);
static void sbuf_append_string(scan_context *ctx, char *s);
static void sbuf_clear(scan_context *ctx);
static void sbuf_emit(scan_context *ctx);

static const scan_state *
normal_func(scan_context *ctx, int input)
{
switch (input) {
case '#': sbuf_emit(ctx);
putchar(input);
return ctx->bol ? SCAN_STATE(preproc)
: SCAN_STATE(normal);
case '?': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(normal_maybe_tri_1);
case '"': ctx->bol = 0;
sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(string);
case '\'': ctx->bol = 0;
sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(char);
case '/': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(slash);
case '\\': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(normal_maybe_splice);
case '\n': ctx->bol = 1;
/* fallthrough */
case ' ':
case '\t': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
default: ctx->bol = 0;
sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
}
}

static const scan_state *
normal_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_append_char(ctx, input);
return SCAN_STATE(normal_maybe_tri_2);
default: ctx->bol = 0;
sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
normal_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': ctx->bol = 0;
putchar(input);
return SCAN_STATE(normal_maybe_tri_2);
case '=': sbuf_emit(ctx);
putchar(input);
return ctx->bol ? SCAN_STATE(preproc)
: SCAN_STATE(normal);
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': ctx->bol = 0;
sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
case '/': sbuf_append_char(ctx, input);
return SCAN_STATE(normal_maybe_splice);
default: sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
normal_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
default: ctx->bol = 0;
/* fallthrough */
case '\n': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
}
}

static const scan_state *
string_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(string_maybe_tri_1);
case '"': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
case '\\': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(string_maybe_splice);
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(string);
}
}

static const scan_state *
string_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_append_char(ctx, input);
return SCAN_STATE(string_maybe_tri_2);
default: sbuf_emit(ctx);
return SCAN_STATE(string)->scan(ctx, input);
}
}

static const scan_state *
string_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': putchar(input);
return SCAN_STATE(string_maybe_tri_2);
case '/': sbuf_append_char(ctx, input);
return SCAN_STATE(string_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(string);
default: sbuf_emit(ctx);
return SCAN_STATE(string)->scan(ctx, input);
}
}

static const scan_state *
string_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n':
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(string);
}
}

static const scan_state *
char_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(char_maybe_tri_1);
case '\'': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
case '\\': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(char_maybe_splice);
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(char);
}
}

static const scan_state *
char_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_append_char(ctx, input);
return SCAN_STATE(char_maybe_tri_2);
default: sbuf_emit(ctx);
return SCAN_STATE(char)->scan(ctx, input);
}
}

static const scan_state *
char_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': putchar(input);
return SCAN_STATE(char_maybe_tri_2);
case '/': sbuf_append_char(ctx, input);
return SCAN_STATE(char_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(char);
default: sbuf_emit(ctx);
return SCAN_STATE(char)->scan(ctx, input);
}
}

static const scan_state *
char_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n':
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(char);
}
}

static const scan_state *
slash_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_append_char(ctx, input);
return SCAN_STATE(slash_maybe_tri_1);
case '\\': sbuf_append_char(ctx, input);
return SCAN_STATE(slash_maybe_splice);
case '/': sbuf_clear(ctx);
return SCAN_STATE(slashslash);
case '*': sbuf_clear(ctx);
return SCAN_STATE(slashsplat);
default: sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
slash_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slash_maybe_tri_2);
default: sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
slash_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_emit(ctx);
sbuf_append_string(ctx, "??");
return SCAN_STATE(normal_maybe_tri_2);
case '/': sbuf_append_char(ctx, '?');
sbuf_append_char(ctx, input);
return SCAN_STATE(slash_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': sbuf_append_char(ctx, '?');
sbuf_append_char(ctx, input);
sbuf_emit(ctx);
return SCAN_STATE(normal);
default: sbuf_append_char(ctx, '?');
sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
slash_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n': sbuf_append_char(ctx, input);
return SCAN_STATE(slash);
default: sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
slashslash_func(scan_context *ctx, int input)
{
/* UNUSED */ ctx = ctx;
switch (input) {
case '?': return SCAN_STATE(slashslash_maybe_tri_1);
case '\\': return SCAN_STATE(slashslash_maybe_splice);
case '\n': putchar(' ');
putchar(input);
return SCAN_STATE(normal);
default: return SCAN_STATE(slashslash);
}
}

static const scan_state *
slashslash_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slashslash_maybe_tri_2);
default: return SCAN_STATE(slashslash)->scan(ctx, input);
}
}

static const scan_state *
slashslash_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slashslash_maybe_tri_2);
case '/': return SCAN_STATE(slashslash_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': return SCAN_STATE(slashslash);
default: return SCAN_STATE(slashslash)->scan(ctx, input);
}
}

static const scan_state *
slashslash_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n': return SCAN_STATE(slashslash);
default: return SCAN_STATE(slashslash)->scan(ctx, input);
}
}

static const scan_state *
slashsplat_func(scan_context *ctx, int input)
{
/* UNUSED */ ctx = ctx;
switch (input) {
case '*': return SCAN_STATE(slashsplat_splat);
default: return SCAN_STATE(slashsplat);
}
}

static const scan_state *
slashsplat_splat_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slashsplat_splat_maybe_tri_1);
case '\\': return SCAN_STATE(slashsplat_splat_maybe_splice);
case '/': putchar(' ');
return SCAN_STATE(normal);
default: return SCAN_STATE(slashsplat)->scan(ctx, input);
}
}

static const scan_state *
slashsplat_splat_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slashsplat_splat_maybe_tri_2);
default: return SCAN_STATE(slashsplat)->scan(ctx, input);
}
}

static const scan_state *
slashsplat_splat_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '/': return SCAN_STATE(slashsplat_splat_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': return SCAN_STATE(slashsplat);
default: return SCAN_STATE(slashsplat)->scan(ctx, input);
}
}

static const scan_state *
slashsplat_splat_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n': return SCAN_STATE(slashsplat_splat);
default: return SCAN_STATE(slashsplat)->scan(ctx, input);
}
}

static const scan_state *
preproc_func(scan_context *ctx, int input)
{
switch (input) {
case '\\': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(preproc_maybe_splice);
case '\n': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(preproc);
}
}

static const scan_state *
preproc_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_append_char(ctx, input);
return SCAN_STATE(preproc_maybe_tri_2);
default: sbuf_emit(ctx);
return SCAN_STATE(preproc)->scan(ctx, input);
}
}

static const scan_state *
preproc_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': putchar(input);
return SCAN_STATE(preproc_maybe_tri_2);
case '/': sbuf_append_char(ctx, input);
return SCAN_STATE(preproc_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(preproc);
default: sbuf_emit(ctx);
return SCAN_STATE(preproc)->scan(ctx, input);
}
}

static const scan_state *
preproc_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n':
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(preproc);
}
}

/*************************/
/**** BUFFER HANDLING ****/
/*************************/

static void
sbuf_append_char(scan_context *ctx, int c)
{
if (ctx->sbuf == 0) {
ctx->sbuf = malloc(ctx->sbufsz = 128);
} else if (ctx->sbufcnt == ctx->sbufsz) {
char *p = realloc(ctx->sbuf, ctx->sbufsz *= 2);
if (p == 0) {
fprintf(stderr, "%s: memory allocation failure\n",
progname);
exit(EXIT_FAILURE);
}
ctx->sbuf = p;
}

ctx->sbuf[ctx->sbufcnt++] = c;
ctx->sbuf[ctx->sbufcnt] = '\0';
}

static void
sbuf_append_string(scan_context *ctx, char *s)
{
while (*s != '\0') {
sbuf_append_char(ctx, *s++);
}
}

static void
sbuf_clear(scan_context *ctx)
{
ctx->sbufcnt = 0;
if (ctx->sbuf) {
ctx->sbuf[ctx->sbufcnt] = '\0';
}
}

static void
sbuf_emit(scan_context *ctx)
{
if (ctx->sbuf == 0 || ctx->sbufcnt == 0) {
return;
}

printf("%s", ctx->sbuf);
sbuf_clear(ctx);
}

/********************/
/**** TEST CASES ****/
/********************/

/* a comment */
/\
* a comment split */
/\
\
* a comment split twice */
/*
block comment
*/
/* comment, trailing delimiter split *\
/
/* comment, trailing delimiter split twice *\
\
/
/* comment, trailing delimiter split once, and again by trigraph *\
??/
/

static const char *a = /* comment in code "*/"Hello, "/**/"World!";
static const char *b = /\
* comment on code line split */ "Hello, " /\
\
* comment on code line split twice */ "World!";

#if 0
??/* this does not start a comment */
#endif

#define FOO1 /* don't touch this */
#define FOO2 \
/* don't touch this */

/* comment */ #define FOO3 /* don't touch this */

#define FOO4 /* don't touch
#define FOO5 this */

#if defined(__STDC__) && (__STDC__ == 1)
#if defined(__STD_VERSION__) && (__STD_VERSION__ >= 199901L)
//*** MORE TEST CASES ***//
/\
/ // comment split
/\
\
/ // comment split twice
static const char *c = // // comment on code line
"Hello, " /\
/ // comment on code line split
"World!" /\
\
/ // comment on code line split twice.
;

#if 0
??// this does not start a comment
#endif

// This is a // comment \
on two lines

#else
static const char *c = "STDC without STD_VERSION";
#endif
#endif

CBFalconer · Oct 25, 2006

jxh said:
CBFalconer wrote:
.... snip ...

It fails the split comment cases, such as these:

/\
* this is a comment */

/\
/ this is a comment too

Also from the previous thread, I learned about not messing with the
preprocessor
directives, so both yours and mine failed cases like:

#define COMMENT_START /* blah blah blah
#define COMMENT_END blah blah blah */

Of course, keep in mind corner cases like:

/* hey */ #define FOO \
/* bzzt */

Thanks, I will look into those. The #defines don't seem to be a
problem, since the second #define is within the comment and should
be ignored. i.e. you can't do that.

// comments	35	Apr 26, 2008
A simple parser	121	Oct 14, 2006
Text processing	29	Sep 26, 2011
Command Line Arguments	0	Mar 7, 2023
Working with files	1	Dec 10, 2021
Serial port	5	Jun 2, 2013
hexump.c	79	Sep 9, 2011
Taking a stab at getline	40	Feb 7, 2013

How to remove // comments

Mark McIntyre

CBFalconer

Jalapeno

Jalapeno

jxh

Keith Thompson

Mark McIntyre

Keith Thompson

CBFalconer

Keith Thompson

Walter Bright

Walter Bright

Walter Bright

Jalapeno

Keith Thompson

Jordan Abel

Jalapeno

CBFalconer

jxh

CBFalconer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads