How is a compiler written

T

tuncay tekle

Well, I did not google it, but the best way to write a compiler is
using the tools, "lex" and "yacc" (or also known as - with some
differences - "flex" and "bison"), the first one is a lexer
(scanner-lexical analyzer), and the second one is a parser.

I am sure you can find most of the info in the web, but if you really
want to learn this stuff, I would suggest you to buy the "Dragon Book",
which is formally named "Compilers, Principles, Techniques, and Tools"
by Aho, Sethi and Ullman, this is the de-facto book in the field.
 
K

Keith Thompson

tuncay tekle said:
Well, I did not google it, but the best way to write a compiler is
using the tools, "lex" and "yacc" (or also known as - with some
differences - "flex" and "bison"), the first one is a lexer
(scanner-lexical analyzer), and the second one is a parser.

I am sure you can find most of the info in the web, but if you really
want to learn this stuff, I would suggest you to buy the "Dragon Book",
which is formally named "Compilers, Principles, Techniques, and Tools"
by Aho, Sethi and Ullman, this is the de-facto book in the field.

I take it you haven't been following this newsgroup closely. If you
had, you would know by now that it's unsafe to assume that everyone
can see the article to which you're replying, and that you need to
provide some context.

If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.

And please complain to Google about their broken interface.

As for the original question, parsing C turns out to be fairly
difficult because of the typedef problem. A typedef name is lexically
an identifier, but it acts like a reserved word -- but only in the
scope where it's defined. That means your parser can't analyze the
code without feedback from the symbol table.

Once you've solved this, you've still done only a fraction of the work
necessary to implement a compiler. You should ask yourself whether
it's really worth the effort, given that ther are a number of freeware
C compilers already. I'm not by any means saying you shouldn't do it,
just that it's going to be a lot of work.

There's also a "comp.compilers" newsgroup (which you should browse for
a while before posting, and read the FAQ if they have one).
 
T

tuncay tekle

you are right :) it's my first time posting anything in a group..

it doesn't make sense to me how google ignores this bug though
 
G

Greg Comeau

Well, I did not google it, but the best way to write a compiler is
using the tools, "lex" and "yacc" (or also known as - with some
differences - "flex" and "bison"), the first one is a lexer
(scanner-lexical analyzer), and the second one is a parser.

I am sure you can find most of the info in the web, but if you really
want to learn this stuff, I would suggest you to buy the "Dragon Book",
which is formally named "Compilers, Principles, Techniques, and Tools"
by Aho, Sethi and Ullman, this is the de-facto book in the field.

The Dragon Book is a good one suggestion, but suggesting the
best way to write a compiler is with lex and yacc is not.
This is completely OT and best is a compiler NG, say comp.compilers
 
J

Julian V. Noble

petantik said:
Where can I find a good resource, on the web, that will give me a good
comprehensive idea of how to write a compiler in C for C or maybe
another language.

http://petantik.blogsome.com - A Lucid look at Reality ... maybe?

If you are really interested in this I suggest you get hold of a copy of
"the dragon book" (so-called because it has a dragon on the cover):

"Compilers, principles, techniques, and tools"

by Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman
(Addison-Wesley Pub. Co., Reading, Mass., 1986).

Once you have had a look at this you can decide rationally whether you
want to make a recursive-descent, tree- or table-driven compiler.

Then I suggest as an exercise writing a simple formula evaluator in your
language of choice. That is, you type in a formula of the form

b*c/cosh(d)

and have the result displayed on the output. For example, I use the Forth
language that does not have intrinsic formula translation like Fortran, C,
etc. So I wrote such a beast using recursive-descent (Forth is stack-based,
and recursion uses stacks, so that seemed most natural). The translator
will evaluate a formula, as with (this is actual screen output)

fvariable x fvariable y ok
f$" x=3" f$" y=4" ok
f$" x*(x^2+y^2)" f. 75.0000 ok

You can look at the code for this (which should be easily portable to C)
at

http://galileo.phys.virginia.edu/classes/551.jvn.fall01/programs.htm

There are 3 versions of the FORmula TRANslator, reflecting changes in
my thinking (and increased capabilities of what it will translate).

(Before anyone singes me for mentioning Forth on the C newsgroup, let me
note that the inquirer asked for a Web-based resource and this is the only
one I am truly familiar with.)

In particular look at the "docs" link because it discusses the Backus-Naur
specification of this formula parser.


If Forth doesn't please you--and there is no reason why it should, as many
don't like it at all--you can look at Kruse, "Data Structures and Program
Design". There are versions in Pascal, C and C++ . Among other things he
shows you how to write a tree-structured formula evaluator.

If Forth *does* grab you, there exists a generalized program called "gray"
for creating parsers in the Forth language. It is more-or-less the
Forth equivalent of YACC (see below), and was written by Anton Ertl.

http://www.complang.tuwien.ac.at/projects/forth.html

Ertl provides examples: an integer 4-function calculator, a tiny Pascal,
and an Oberon -> Forth translator.

Finally there is YACC --"yet another compiler-compiler"-- a tool in C
used for automating the development of C code for translating any language
you input to any other one, as long as you adhere to its rules. See, e.g.

http://epaperpress.com/lexandyacc/

I hope these resources help.

--
Julian V. Noble
Professor Emeritus of Physics

http://galileo.phys.virginia.edu/~jvn/

"As democracy is perfected, the office of president represents, more and
more closely, the inner soul of the people. On some great and glorious
day the plain folks of the land will reach their heart's desire at last
and the White House will be adorned by a downright moron."

--- H. L. Mencken (1880 - 1956)
 
K

Keith Thompson

tuncay tekle said:
you are right :) it's my first time posting anything in a group..

it doesn't make sense to me how google ignores this bug though

It doesn't make sense to the rest of us either.

One more thing: please don't top-post. Your response goes below, or
interspersed with, any quoted text, so the article can be read
naturally from top to bottom. It's not usually necessary to quote the
entire article, only the portions relevant to your reply.

And the shift key is your friend. Standard capitalization really does
make your text easier to read.

Welcome to the group!
 
J

jacob navia

petantik said:
Where can I find a good resource, on the web, that will give me a good
comprehensive idea of how to write a compiler in C for C or maybe
another language.




http://petantik.blogsome.com - A Lucid look at Reality ... maybe?

To understand how a C compiler works, one of the best books on this
subject is

A Retargetable C Compiler: Design and Implementation, Addison-Wesley
by Fraser and Hanson.

It is the description of the lcc compiler.
You can download the lcc source code from:
ftp://ftp.cs.princeton.edu/pub/packages/lcc/

The book is relatively easy to read, and the source code
is well documented.

jacob
 
T

tuncay tekle

Greg said:
The Dragon Book is a good one suggestion, but suggesting the
best way to write a compiler is with lex and yacc is not.
This is completely OT and best is a compiler NG, say comp.compilers

I know it is still OT but since you made a point I wondered what you
would suggest instead of lex & yacc. I mean, hey, these are powerful
tools. :)
 
S

Skarmander

tuncay said:
I know it is still OT but since you made a point I wondered what you
would suggest instead of lex & yacc. I mean, hey, these are powerful
tools. :)

They're also dinosaurs. Or, as the lex manual puts it, "the asteroid to
kill this dinosaur is still in orbit".

Lex and yacc came first and did a good job. As a result they're now
ubiquitous, ported to just about every platform and remade in just about
any language, but that doesn't mean they're always the best approach to
lexer/parser generation. Their portability is a big win, but they have
some warts that are as annoying as (say) having to use tabs in
makefiles, and they're very barebones for use in something as big as a
compiler.

Check out ANTLR, for example: http://www.antlr.org/. It doesn't generate
C, but it's a lot friendlier than the lex/yacc combo. There are loads
and loads of other lexer/parser generators, many better than lex/yacc
for many purposes.

Lex and yacc will continue to be around in the foreseeable future
because just about any Unix-like platform has them (and a C compiler),
but they're the first word, not the last.

S.
 
T

tuncay tekle

Skarmander said:
Check out ANTLR, for example: http://www.antlr.org/. It doesn't generate
C, but it's a lot friendlier than the lex/yacc combo. There are loads
and loads of other lexer/parser generators, many better than lex/yacc
for many purposes.

Of course, I do know that lex & yacc are the dinasours, but when it
comes to friendliness, I would not conclude lex & yacc are your best
friends (first of all, they are hard to debug) but they are as friendly
as C and UNIX goes.

The point is what do you mean by *better*. What I wonder is in terms of
power, is there any lexer/parser doing sth that lex&yacc cannot do
(easily) - ANTLR did not look promising to me in this context. By the
way, no offense in any means, I am not defending these tools, I am just
curious :)
 
P

petantik

Well its seems like quite a complex and large undertaking. I'm not
about to write a compiler - not yet anyway :) .

I just wanted to satisfy my curiosity especially since i'm doing a
CS/Engineering degree and am taking classes in Compter Architecture and
Design. Today we were discussing why CAD is important and writing
compilers was one of the issues raised by Duncan Smeed, my lecturer.








http://petantik.blogsome.com - keeping it real
 
S

Skarmander

tuncay said:
Of course, I do know that lex & yacc are the dinasours, but when it
comes to friendliness, I would not conclude lex & yacc are your best
friends (first of all, they are hard to debug) but they are as friendly
as C and UNIX goes.
A very apt judgment, I'd say. Each reader can make of that what they
want. :) I'll go on record as saying that settling for the
"friendliness" of C and Unix is something many programmers are too eager
to do.
The point is what do you mean by *better*.

Exactly. All I can say is that for *many* purposes, there are better
tools than lex/yacc. If you don't need to generate C, there's probably a
language-specific tool for you that will be more accommodating than
lex/yacc, and there's no reason not to use that. If you're working with
a language that's not conveniently described as an LALR grammar, there
are better tools than lex/yacc. If you find merging grammar rules and
code unmaintainable, there are better tools than lex/yacc.

All that doesn't mean lex/yacc have had their day and are inferior
tools. They are surprisingly effective in a great number of situations.
Shopping around a bit for potentially better solutions might pay off,
though. There's enough software to go around.
What I wonder is in terms of power, is there any lexer/parser doing
sth that lex&yacc cannot do (easily) - ANTLR did not look promising
to me in this context.

Oh, you literally mean parsing power? The answer is that it hardly matters.

Most parser generators are naturally tuned to producing efficient
parsers for programming languages. Any sane programming language can be
described with an LALR(1) grammar -- the kind yacc and many other tools
can parse. (C++ is a notable example of an "insane" language; people who
have tried to produce a pure LALR(1) grammar generally agree that it's
not worth the bother, and you're better off doing special processing on
a restricted grammar.)

In fact, many programming languages can be parsed with less than full
LALR. ANTLR is a so-called "predicated LL(k) parser", which is less
powerful than LALR(1), but you'll be hard-pressed to come up with a
language that has a practical LALR(1) grammar but no practical LL(k)
grammar. The theoretical limits of parser generators are real, but
matter surprisingly little in practice. It's mostly about how easy it is
to write things down.

And that depends. LR and LL are different flavors of ice cream; most
people find LL far easier to understand (none of that shift-reduce
hoopla) but LR allows some constructs to be written down more naturally
than LL. And vice versa, of course; it's hard to judge these things
objectively. Plus, each parser generator will have its own set of bells
and whistles that will allow you to write down certain common things
easily. It might have exactly what you need or miss the mark.

If you really need parsers with even *more* power, they exist, but even
when optimized for performance, these tend to be too slow to use for
tasks like compiling. For example, for most languages a complete
annotated syntax tree could be built with a context-sensitive parser,
instead of getting a context-free grammar with a tool like yacc and then
computing the context-sensitive bits. But context-sensitive parsing is
not easy to do efficiently (let alone linearly), and writing
context-sensitive grammars can be tricky as compared to the "compute
stuff on a context-free tree", which most programmers find quite natural
to do.

This is not the place to go into a discussion of the various tradeoffs
of parsers; if you don't know this stuff, I recommend getting a good
book on parsing techniques. One of my favorites is unimaginatively
titled "Parsing Techniques - A Practical Guide" by Grune and Jacobs, but
unfortunately it's currently out of print (I saw a second edition is in
the works, though, which is great). It also contains *way* more
information than you'll need for your average compiler, but I felt like
plugging it. :)

Try Googling and asking around in comp.compilers. They should have
archives full of discussion on these matters. (Disclaimer: I've never
been there.) There's also an overview page of parser generators on
Wikipedia: http://en.wikipedia.org/wiki/Compiler-compiler. Look around,
see what you can find.
By the way, no offense in any means, I am not defending these tools,
I am just curious :)

It's highly unlikely you could say anything that would offend me, and it
certainly wouldn't involve opinions on parsing tools. Curiosity is good.

Bottom line: look beyond lex & yacc for fun and profit sometimes. Even
if you're going to stick with them for the rest of your days, it's nice
to know what else is out there.

S.
 
J

Jordan Abel

Well, I did not google it, but the best way to write a compiler is
using the tools, "lex" and "yacc" (or also known as - with some
differences - "flex" and "bison"), the first one is a lexer
(scanner-lexical analyzer), and the second one is a parser.

Well, they're actually generators of such, but yeah. Though i have
been told that most compilers these days use higher-performance
solutions
 
T

the_cacc

hey keith, do you have any suggestions for how to be a prat? you know,
most people learn early in life to grow up. try it, you might like it :)
 
K

Keith Thompson

hey keith, do you have any suggestions for how to be a prat? you know,
most people learn early in life to grow up. try it, you might like it :)

As a matter of fact, I do some suggestions. Just wait around until
you see someone offering sound advice, then barge in and throw a few
insults around. Pointedly ignoring the offered advice at the same
time might be a nice touch.

But I'm sure you don't need any suggestions from me on the topic.
 
E

Erik de Castro Lopo

Jordan said:
Well, they're actually generators of such, but yeah. Though i have
been told that most compilers these days use higher-performance
solutions

Lexing and parsing is a very small portion of what a compiler
does these days hence there is little to be gained in speed
by using hand written code in the lexer and parser when the
lex/yacc solution is already pretty damn good.

Erik
--
+-----------------------------------------------------------+
Erik de Castro Lopo
+-----------------------------------------------------------+
"Islam isn't in America to be equal to any other faith, but to become
dominant. The Koran should be the highest authority in America, and Islam
the only accepted religion on earth." -- Omar Ahmad, board chairman of
Council of American Islamic Relations to an islamic audience in 1998.
 
R

Richard Bos

hey keith, do you have any suggestions for how to be a prat?

I don't know whether Keith does, but I do. It's an easy three-step
program:

1. Sign up for Google Groups Broken Beta
2. Switch off your brain
3. Start posting

Note that step 2. is crucial: without it, you might become only a decent
netizen struggling with a brainless interface, rather than a brainless
netizen inflicting the outfall of the interface on the rest of us, and
we can't have that, can we?

Richard
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,170
Messages
2,570,927
Members
47,469
Latest member
benny001

Latest Threads

Top