Keith Thompson said:
I know that parsing C is difficult because of the treatment of typedef
names (are there other serious problems?), but what makes writing a
lexer so difficult? Are you referring to the preprocessing phase? I
suppose you really need two lexers, one that's part of the
preprocessor and another that works on the preprocessor's output, but
I don't see that either of them would be overly complex.
(To keep this topical, I'm looking for an answer in terms of something
about the language definition that makes it difficult.)
The preprocessor throws a monkey wrench into things. The preprocessor has a
different idea of what a token is than C. There are the phases of
translation. There are things like supporting multiple source character
sets. Getting the preprocessor to be standards compliant is a fiendishly
difficult task, to my knowledge, the Digital Mars preprocessor is one the
very few(*) that is (and this is what, 15 years after the standard was
standardized?). There are wierdities like trigraphs, digraphs, backslash
line splicing, \u identifier characters, token concatenation, stringizing,
#line, varying context-dependent meanings of newlines, etc.
And overriding all this, because C compilers normally need to chew on
enormous quantities of #include files (the aggregate often being over a
million lines), is the need for speed. Any data structures used need to be
scalable over a very wide range.
So, writing a compliant and *useful* C lexer is a pretty challenging task.
Writing a lexer for D, or javascript, comparatively speaking, can be done
over dessert <g>.
-Walter Bright
www.digitalmars.com C, C++, D programming language compilers
(*) Caveat: I haven't done exhaustive testing of other compiler
preprocessors, but one example that often fails on otherwise excellent
implementations is the example in the Standard itself. It's very perverse,
and nobody would expect to see such cases in real code, so it isn't
important.