Summary: translation units, preprocessing, compiling and linking?

Steven T. Hatton · Jun 2, 2004

Is there anything that gives a good description of how source code is
converted into a translation unit, then object code, and then linked. I'm
particularly interested in understanding why putting normal functions in
header files results in multiple definition errors even when include guards
are used.

Joe Laughlin · Jun 2, 2004

Steven said:
Is there anything that gives a good description of how
source code is converted into a translation unit, then
object code, and then linked. I'm particularly
interested in understanding why putting normal functions
in header files results in multiple definition errors
even when include guards are used.

Perhaps
http://www.cs.washington.edu/orgs/acm/tutorials/dev-in-unix/compiler.html is
what you are looking for... ?

Steven T. Hatton · Jun 2, 2004

Victor said:
Steven T. Hatton wrote:

Now, when you link those two files together, how many definitions of
'foo' will there be? Which one should be used when called from 'main'?

This will explain my current understanding:
http://baldur.globalsymmetry.com/gs-home/images/random-noise/solution.html

I guess what you explained is more or less what I thought, but then I was
told the #pragma interface and #pragma implementation used by GCC prevented
the creation of an object for each translation unit. I can only guess that
what is happening is that the 'global' object is treated as if it appeard
in the object file of each translation unit. That would explain why I
still get the linker errors using the pragmas. I'll see what the GCC folks
say about this.

E. Robert Tisdale · Jun 3, 2004

Steven said:
Is there anything that gives a good description of how source code is
converted into a translation unit, then object code, and then linked.
I'm particularly interested in understanding why
putting normal functions in header files
results in multiple definition errors even when include guards are used.

When you invoke your C++ compiler,
the C preprocessor accepts a source file and emits a translation unit,
the C++ compiler proper accepts the translation unit
and emits assembler code,
the assembler accepts the assembler code and emits machine code and
the link editor accepts the machine code,
loads it into an executable program file
along with the required objects from library archives
and resolves all of the links.

When the C preprocessor reads your source file,
it includes the header files in the translation unit,
reads and processes all of the macros
then discards all of the macros when it has finished.
It does *not* remember macro definitions
when it processes the next source files
so, if the next source file includes the same header file,
the header file will be read again
and any external function definition in that header
will be included in the next translation unit as well.
The link editor will discover multiple function definitions
if it trys to link the resulting machine code files together.

If, instead, you qualify the function definition as inline or static,
the compiler will label them as "local" links
so the link editor will not complain.

JKop · Jun 3, 2004

Steven T. Hatton posted:

Is there anything that gives a good description of how source code is
converted into a translation unit, then object code, and then linked.
I'm particularly interested in understanding why putting normal
functions in header files results in multiple definition errors even
when include guards are used.

I highly suggest that this be added to the FAQ.

You have Source Code files:

a.cpp
b.cpp
s.cpp

You have 1 Header file:

b.hpp

Both a.cpp and s.cpp include b.hpp.

The three source code files get compiled into object files:

a.obj b.obj s.obj

And they're passed on to the linker.

The linker sees a function, Monkey, in a.obj AND in s.obj, hence a multiple
definition.

So how do you get away with putting inline functions into a header file?
They have internal linkage, ie. these functions aren't presented to the
linker. The "static" is implied in inline functions.

-JKop

Karl Heinz Buchegger · Jun 3, 2004

Steven T. Hatton said:
Is there anything that gives a good description of how source code is
converted into a translation unit, then object code, and then linked. I'm
particularly interested in understanding why putting normal functions in
header files results in multiple definition errors even when include guards
are used.

This is what I posted some time ago. It may be of some help
to you.

--------

First of all let me introduce a few terms and clearify
their meaning:

source code file The files which contains C or C++
code in the form of functions and/or
class definitions

header file Another form of source file. Header files
usually are used to seperate the 'interface'
description from the actual implementation
which resides in the source code files.

object code file The result of feeding a source code file through
the compiler. Object code files already contain
machine code, the one and only language your computer
understands. Nevertheless object code at this stage
is not executable. One object code file is the direct
translation of one source code file und thus usually
lacks external references, eg. the actual implementation
of functions which are defined in other source code files.

library file a collection of object code files. It happens frequently that
a set of object code files is always used together. Instead
of always listing all those object code files during the
link process it is often possible to build a library from
them and use the library instead. But there is no magic
with a library. A library can be seen as some repository
where one can deposit object code files such that the library
forms a collection of them.

compiling the process of transforming the source code files into
object code file. C and C++ define the concept of 'translation
unit'. Each translation unit (normally: one single source code
file) is translated independently of all other translation units.

linking the process of combining multiple object code files and libraries
into an executable. During the linking process all external references
of one object code file are examined and the linker tries to find
modules which satisfy those external references.

In practice the whole process works as follows:
Say you have 2 source files (with errors, we will return to them later)

main.c
******

int main()
{
foo();
}

test.c
******

void foo()
{
printf( "test\n" );
}

and you want to create an executable. The steps are
as in the graphics:

main.c test.c
+----------------+ +-----------------------+
| | | |
| int main() | | void foo() |
| { | | { |
| foo(); | | printf( "test\n" ); |
| } | | } |
+----------------+ +-----------------------+
| |
| |
v v
********** **********
* Compiler * * Compiler *
********** **********
| |
| |
| |
main.obj v test.obj v
+--------------+ +--------------+
| machine code | | machine code |
+--------------+ +--------------+
| |
| |
+------------------+ +--------------------+
| |
v v
************* Standard Library
* Linker *<----------+--------------------+
************* | eg. implementation |
| | of printf or the |
| | math functions |
| | |
| +--------------------+
main.exe v
+-------------------------+
| Executable which can |
| be run on a particluar |
| operating system |
+-------------------------+

So the steps are: compile each translation unit (each source file) independently
and then link the resulting object code files to form the executable. To do that
misssing functions (like printf or sqrt) are added by linking in a prebuilt library
which contains the object modules for them.

The important part is:
Each translation unit is compiled independently! So when the compiler compiles
test.c it has no knowledge about what happend in main.c and vice versa. When the
compiler tries to compile main.c it eventually reaches the line
foo();
where main.c tries to call function foo(). But the compiler has never heared about
a function foo! Even if you have compiled test.c prior to it, when main.c is
compiled this knowledge is already lost. Thus you have to inform the compiler
thar foo() is not a typing error and that there indeed is somewhere a function
called foo. You do this with an function prototype:

main.c
+----------------+
| void foo(); |
| |
| int main() |
| { |
| foo(); |
| } |
+----------------+
|
|
v
**********
* Compiler *
**********
|

Now the compiler knows about this function and can do its job. In very much the same way
the compiler has never heared about a function called printf(). printf is not part of
the 'core' language. In a conforming C implementation it has to exist somewhere, but
printf() is not on the same level as 'int' is. The compiler knows about 'int' and
what it means, but printf is just a function call and the compiler has to know its
parameters and return type in order to compile a call to it. Thus you have to inform
the compiler of its existence. You could do this in very much the same way as you
did it in main.c, by writing a prototype. But since this is needed so often and
there are so many other functions available, this very fast gets boring and error prone.
Thus somebody else has already provided all those protoypes in a seperate file, called
a header file, and instead of writing the protoypes by yourself, you simply 'pull in'
this header file and have them all available:

test.c
+-----------------------+
| #include <stdio.h> |<-+
| | |
| void foo() | |
| { | |
| printf( "test\n" ); | |
| } | |
+-----------------------+ |
| |
| |
v |
********** stdio.h v
* Compiler * +-------------------------------------+
********** | ... |
| | int printf( const char* fmt, ... ); |
| ... |
+-------------------------------------+

And now the compiler has everything it needs to know to compile test.c
Since main.c and test.c could have been compiled successfully they can be linked
to the final executable which can be run. During the process of linking the linked
figures out that there is a call to foo() in main.obj. Thus the linker tries to find
a function called foo. It finds this function by searching through the object
module test.obj. The linker thus inserts the correct memory address for foo
into main.obj and also includes foo from test.obj into the final executable. But
in doing so, the linker also figures out, that in function foo() there is a call
to printf. The linker thus searches for a function printf. It finds it in the
standard library, which is always searched when linking a C program. There the
linker finds a function printf and this function thus is included into the
final executable too. printf() by itself may use other functions to do its
work but the linker will find all of them in the standard library and include
them into the final executable.

There is one thing left to talk about. While main.c is correct from a technical
point of view it is still unsatisfying. Imagine that our functoni foo() has
a much more complicated argument list. Also imagine that your program does not
consist of just those 2 translation units but instead has 100-dreds of them and
that foo() needs to be called in 87 of them. Thus you would have write a prototype
in every single one of them. I think I don't have to tell you what that means: All those
prototypes need to be correct and just in case function foo() changes (things like
that happen), all those 87 prototypes need to be updated. So how can you do that?
You already know the solution, you have used it already. You do pretty much
the same as you did in the case of stdio.h. You write a header file and
include this instead of the prototype:

main.c
+-------------------+ test.h
| #include "test.h" |<---------+-------------+
| | | void foo(); |
| int main() | | |
| { | +-------------+
| foo(); |
| } |
+-------------------+
|
|
v
**********
* Compiler *
**********
|

Now you can include that header file in all the 87 translation units which
need to know about foo(). And if the prototype for foo() needs some update
you do it in one central place: by editing file test.h. All 87 translation
units will pull in this updated protype when they are recompiled.

HTH

Steven T. Hatton · Jun 3, 2004

If, instead, you qualify the function definition as inline or static,
the compiler will label them as "local" links
so the link editor will not complain.

I think anonymous namespaces will act in a similar way. It seems there are
many ways to bang your thumb when working with C++ #includes. Oh, and then
there's the question of how the compiler knows something is a C++ source
file. If there is no distinction between source and header, they what
tells it that a particular file is a header? As it turns out:

"Compilation can involve up to four stages: preprocessing, compilation
proper, assembly and linking, always in that order. The first three stages
apply to an individual source file, and end by producing an object file;
linking combines all the object files (those newly compiled, and those
specified as input) into an executable file.

"For any given input file, the file name suffix determines what kind of
compilation is done:

file.c
C source code which must be preprocessed.

file.i
C source code which should not be preprocessed.

file.ii
C++ source code which should not be preprocessed.

file.m
Objective-C source code. Note that you must link with the
library libobjc.a to make an Objective-C program work.

file.mi
Objective-C source code which should not be preprocessed.

file.h
C header file (not to be compiled or linked).
file.cc
file.cp
file.cxx
file.cpp
file.c++
file.C
C++ source code which must be preprocessed. Note that in .cxx, the last two
letters must both be literally x. Likewise, .C refers to a literal capital
C."

So I type in 'gcc -ofoo main.cc' and get a bunch of errors, then type `g++
-ofoo main.cc' and the same code comepiles. Go figure!

Howard · Jun 3, 2004

"Compilation can involve up to four stages: preprocessing, compilation
proper, assembly and linking, always in that order. The first three stages
apply to an individual source file, and end by producing an object file;
linking combines all the object files (those newly compiled, and those
specified as input) into an executable file.

"For any given input file, the file name suffix determines what kind of
compilation is done:

file.c
C source code which must be preprocessed.

file.i
C source code which should not be preprocessed.

file.ii
C++ source code which should not be preprocessed.

file.m
Objective-C source code. Note that you must link with the
library libobjc.a to make an Objective-C program work.

file.mi
Objective-C source code which should not be preprocessed.

file.h
C header file (not to be compiled or linked).
file.cc
file.cp
file.cxx
file.cpp
file.c++
file.C
C++ source code which must be preprocessed. Note that in .cxx, the last two
letters must both be literally x. Likewise, .C refers to a literal capital
C."

So I type in 'gcc -ofoo main.cc' and get a bunch of errors, then type `g++
-ofoo main.cc' and the same code comepiles. Go figure!

This looks like you've quoted a particular implementation's documentation,
not the standard. If I recall correctly, there is nothing in the standard
that specifies that a file have a particular extension, or any extension at
all, for that matter. As far as I know, you can call your main source file
"Bob", and have it include a "header" file called "Carol". What makes
something a source file is if you instruct your compiler to compile it.
What makes it a header file is if you include it from one or more source
files, but don't directly compile it. (Actually, there is no such thing in
the standard as a "header file", if I recall. The term is used to specify
an include file that contains the declarations for the classes and/or
functions implemented in the source file.) Exactly how you instruct your
compiler to compile a file is up to the compiler vendor. Obviously, some
vendors chose to recognize specific file extensions as compileable (source)
files. (Possibly to make it easier to compile all source files in a given
directory?) Likewise, extensions like .a and .o are totally arbitrary,
although they tend to follow common practice.

-Howard

"All programmers write perfect code.
....All other programmers write crap."

"I'm never wrong.
I thought I was wrong once,
but I was mistaken."

linking error when compiling CVOde in cygwin	0	Oct 4, 2006
C++ improvements: my 2008 summary	2	Jan 1, 2009
compiling/linking issues.	0	Sep 15, 2004
About compiling errors, using <list> or <list.h>	3	Feb 14, 2008
[SUMMARY] Symbolify (#169)	1	Jul 17, 2008
What does linking library components mean in C++	20	Jul 23, 2005
[SUMMARY] Literate Ruby (#102)	0	Nov 24, 2006
[SUMMARY] Chip-8 (#88)	0	Aug 3, 2006

Summary: translation units, preprocessing, compiling and linking?

Steven T. Hatton

Joe Laughlin

Steven T. Hatton

E. Robert Tisdale

JKop

Karl Heinz Buchegger

Steven T. Hatton

Howard

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads