Removing dead code and unused functions

  • Thread starter Geronimo W. Christ Esq
  • Start date
G

Geronimo W. Christ Esq

Yes, automated tools that scan code and interpret it will help. But I don't
see the relation between "Where the bugs are" and "Where control flow is
not". The principle "Ain't broke don't fix it" applies here. Dead code ain't
broke. Bugs will lead to investigation of the live code causing them.

It is a perfectly sound analysis, but if you have taken over maintenance
of a large code base it is harder to see the wood for the trees.
 
J

jacob navia

Martijn said:
And if you do so, most decent compilers (at least GCC does) with the
appropriate warnings enabled will find unreferenced static functions for
you.
Well of course, lcc-win32 will warn you about unreferenced statics.
The problem is when you have a non-static function or variable
in some file that is not referenced in any other file.

The algorithm I used in the IDE is to look at the object files
then look at the public symbols, then make a cross reference of them.

This has the advantage of getting beyond superficial similarities
like:

file 1:
int foo(void)
{
}

file 2:
static int foo(void)
{
//...
}

int main(void)
{
foo();
}

I take advantage of the work done by the compiler and the method is
(more or less) compiler independent since the object files are
completely standardized.
 
P

Phlip

Geronimo said:
It is a perfectly sound analysis, but if you have taken over maintenance
of a large code base it is harder to see the wood for the trees.

Each time I did, it came with a list of bugs, and a boss to prioritize them.
I've got just under a million lines of code here that have just come
into my possession. I'd love to believe that a few minutes would allow
me to create a suite of tests proving that the program generated from
that codebase worked the same before and after any changes, but I remain
somewhat cynical.

Writing a test for each bug taught me a lot about the structure. I'm aware
there are other ways to learn, but if there's a lot of code (there was) and
a very small bug loci (there was), then spotting the dead code rapidly
became trivial.

Next, you can write a characterization test for a large codebase very
easily - but it's not really a test, just a thing to help refactoring.
Increase the logging until the program is repulsively verbose. Write 3 or 4
test cases that feed high-level data in, collect the results, collect the
logs, and compare them to golden copies. Trivial byte comparisons are fine.
Now run these test-like things after every 1~10 edits. If they fail, don't
inspect why. (The byte comparison will be _very_ hard to track back to a
failing unit.) If they fail, use Undo to return to the passing state.

If you have a big legacy codebase, and you seek to remove some code to see
what's left, you won't get far, and every mistake will add a bug. Yes, the
code looks crappy, but most of it is used, and some of it "might" be used.
Suppose your automated tool helped you remove 10% (and suppose the tool was
somehow magically better at catching mistakes than a test). You still have
90% left to deal with. I don't see the gain. You still have to climb the
mountain.
Timescales and budgets do not presently permit me to sit down and write
tests for a huge body of code which I am not completely familiar with. I
have no doubts about the wisdom or long term benefits of doing it, but I
don't possess the resources at the moment.

Don't automate writing tests for everything, or manually write tests for
everything.

Write tests for each bug. They will not prevent bugs, and they won't spot
dead code. They will, however, force test attention on the areas of the code
that have the bugs. They will document and service that area. Long term,
development will get faster and faster. Without tests, no matter how clean
the code, over time development will get slower. The benefits will appear
very soon.
 
K

Kevin Bagust

Geronimo said:
Are there any scripts or tools out there that could look recursively
through a group of C/C++ source files, and allow unreferenced function
calls or values to be easily identified ?

LXR is handy for indexing source code, and for a given function or
global variable it can show you all the places where it is referenced.
It would be really nice to have a tool that would simply list all of the
referenced functions, so that you could go through and remove them.

PC-Lint will list unused functions, variables and headers. A free Lint
my do the same but I do not know if that is the case.

Kevin.
 
A

Alan Balmer

Do you have time and resources to debug?

You can leverage tests, like that, to replace many long hours of debugging
for a few short minutes writing tests.

The idea that automated testing requires an "infinite budget" is a myth.

(And if you indeed have a short deadline, why bother removing harmless but
unused code?)

For one thing, so you don't have to write unit tests for it ;-)
 
G

Geronimo W. Christ Esq

Kevin said:
PC-Lint will list unused functions, variables and headers. A free Lint
my do the same but I do not know if that is the case.

Finally, an answer that I can use :) I'm very appreciative Kevin. An
initial check confirms that PC-Lint does indeed appear to do exactly
what I'm looking for. I will make some enquiries.
 
D

Dan Henry

Kevin Bagust said:
PC-Lint will list unused functions, variables and headers. A free Lint
my do the same but I do not know if that is the case.

My input regarding "A free Lint"...

I have grown accustomed to my PC-Lint doing this and when a client
hesitated to purchase PC-Lint at my recommendation, I tried Splint --
a freebie. FWIW, as of my attempt ~1 year ago, it would not announce
unreferenced functions. My client purchased a LAN license for PC-Lint
and everyone is now happy.

Gimpel's FlexeLint presumably has the same features.
 
G

Greg

I'm not sure but "nm" could be useful here.

Linkers typically do not exclude functions in the user program that are
unused. They only do that with libraries.

More useful would be one of the many tools that generate call graphs.

-- Richard[/QUOTE]

The Metrowerks linker, as an example, strips all unreferenced,
unexported functions from a build by default, and does so no matter
where such functions are found. What would be the point of a linker
leaving unreachable code and inaccessible data in a binary? And why
would programmers want to perform this tedious chore by hand themselves
rather than let the linker do it in a few seconds?

The algorithm to strip unused code is well-understood. All the linker
has to do is calculate the "transitive closure" for the set of
functions in the object code to be linked, that includes main(). In
fact calculating the transitive closure is no doubt how Apple was able
to add the "-dead-strip" switch to GNU's ld linker on OS X; and the
reason they did so is clear: many developers are understandably
reluctant to use a linker that bloats their final builds.

Greg
 
G

Geronimo W. Christ Esq

Greg said:
The Metrowerks linker, as an example, strips all unreferenced,
unexported functions from a build by default, and does so no matter
where such functions are found. What would be the point of a linker
leaving unreachable code and inaccessible data in a binary?

They just do it because the linker authors don't put sufficient priority
on dealing with the matter properly. The GNU linker for example only
garbage collects sections rather than individual functions, so if you
use one function in a large object file, the whole object will get linked.
And why
would programmers want to perform this tedious chore by hand themselves
rather than let the linker do it in a few seconds?

If the codebase is full of cruft it is harder to maintain.
 
G

Gordon Burditt

The Metrowerks linker, as an example, strips all unreferenced,
They just do it because the linker authors don't put sufficient priority
on dealing with the matter properly. The GNU linker for example only
garbage collects sections rather than individual functions, so if you
use one function in a large object file, the whole object will get linked.

There may be insufficient information to TELL whether a particular
piece of a compilation is used or not. For example, no law says that
a particular machine instruction generated by the compiler can be
identified as being part of exactly one function. Functions might
share code. And the linker might not be able to TELL that functions
are sharing code.

int check1arg(char **argv)
{
int i;

i = validate(argv[1]);
/* common */
if (i == OK)
return 1;
else if (i == MAYBE)
return 0;
else
return -1;
}
int check2arg(char **argv)
{
int i;

i = validate(argv[2]);
/* common */
if (i == OK)
return 1;
else if (i == MAYBE)
return 0;
else
return -1;
}

For example, the same copy of the code below /* common */ may be
shared between check1arg() and check2arg(). And possibly, check2arg()
is unused. Can the code below /* common */ be omitted? No.
But how does the linker know this? Decompiling compiler output?
Possibly, but that seems to be a lot of extra effort.

Gordon L. Burditt
 
W

Walter Roberson

Then it is clear that you do not understand unit testing.

Perhaps you could explain how you would "unit test" a function
that applies a chaotic formulae to a pair of doubles ?
How do you know if it is the -right- formula? If it is
used as part of an authentication process, how do you know
that there aren't any "back doors" in it that would allow
greatly reduced cost to break in?


There's a simple transform function that has been studied a
fair bit; I don't recall it's proper name. It goes like this:
If the input is even, divide it by 2.
If the input is odd, multiply it by 3 and add 1.
Loop back and repeat using the previous output as input.

As far as I know, it is an open question as to whether every
starting integer will eventually get drawn to the loop
1 -> 4 -> 2 -> 1 . Suppose, though, you had a hypothesis
that there were some values that did not get caught in that
loop, and suppose you further hypothesized that such numbers
would have some particular set of properties. In order to
"unit test" the section that tests the properties, you need
a valid input to feed the section -- but you don't know
yet what the valid inputs *are* because you haven't found
a non-looping number yet. How do you proceed?


In order to unit test without exhaustive search, you have to know some
valid inputs. Not just one either -- you need different inputs
that together cover all branch conditions. If you test small sections
in isolation without sufficient context, yuu might miss a
combination of conditions that is important, such as "sleeper" code
that only activates under particular combinations of circumstances.
You might have to back-solve a cryptographic puzzle in order to
determine whether or not a particular combination of circumstances
*can* occur.


You are, I would suggest, too closely focused on situations in
which the "right answer" is known and testable. If you are
working on scientific or mathematical problems, then you
don't always know, and the only way to test might be to
execute the code.
 
G

Geronimo W. Christ Esq

Gordon said:
There may be insufficient information to TELL whether a particular
piece of a compilation is used or not. For example, no law says that
a particular machine instruction generated by the compiler can be
identified as being part of exactly one function. Functions might
share code. And the linker might not be able to TELL that functions
are sharing code.

<snip>

The example you gave is nothing to do with linking, because a linker
never examines *within* functions to determine whether they are
redundant or not. On the other hand, the GCC compiler does (when the
optimizer is turned on) look for similar pieces of generated machine
code and "compress" them by replacing them with one copy and some pointers.

It would be very useful if the GNU linker would remove unused functions,
but at the moment it doesn't.
 
G

Gordon Burditt

The Metrowerks linker, as an example, strips all unreferenced,
<snip>

The example you gave is nothing to do with linking, because a linker
never examines *within* functions to determine whether they are
redundant or not.

I didn't say it did. I said that if you have two functions compiled
in a (object) file, and one of them isn't needed, there's no guarantee that
the linker can determine what is part of the needed function (and possibly
the other one also) to keep, and what is NOT part of the needed function
(to delete).

You don't get to conclude that function A starts here, and function
B starts here, so everything between those two addresses is function
A, and none of what's between those two is also part of function
B or C, even if I'm only talking about the so-called code segment of
both functions.
On the other hand, the GCC compiler does (when the
optimizer is turned on) look for similar pieces of generated machine
code and "compress" them by replacing them with one copy and some pointers.

So in that situation, you can have functions that share code, and
object code where there is no contiguous block of code where the
linker can determine "this is function A, and all of function A, and
none of any other function".
It would be very useful if the GNU linker would remove unused functions,
but at the moment it doesn't.

The point here is that it may not have the information required to
remove unused functions even if they can be determined to be unused.
The object format may not even PERMIT passing the information required
to determine what code is part of what function(s).

Gordon L. Burditt
 
G

Geronimo W. Christ Esq

Gordon said:
You don't get to conclude that function A starts here, and function
B starts here, so everything between those two addresses is function
A,

I've difficulty picturing how any of the code inside a function can ever
be used in any way if the function is never invoked. I don't see how a
linker would be making an unsafe decision by removing a function that is
never invoked.

I imagine that when a compiler spots repetitive sections of code it
takes them out of the function's object code into a common area of the
object, and has the function point to them. That way redundant functions
could be safely removed.
 
P

Paul Groke

[]
So in that situation, you can have functions that share code, and
object code where there is no contiguous block of code where the
linker can determine "this is function A, and all of function A, and
none of any other function".




The point here is that it may not have the information required to
remove unused functions even if they can be determined to be unused.
The object format may not even PERMIT passing the information required
to determine what code is part of what function(s).

Gordon L. Burditt

In that case the object format should be changed :)
 
G

Gordon Burditt

You don't get to conclude that function A starts here, and function
I've difficulty picturing how any of the code inside a function can ever
be used in any way if the function is never invoked. I don't see how a
linker would be making an unsafe decision by removing a function that is
never invoked.

Given an object file containing two functions, one used and one
not, resulting from a single compilation, what makes you think that
the linker can remove anything and be sure that it has not removed
a piece of the function that *IS* used? Object file formats that
I have seen do not have labels that say this byte is part of function
a, this byte is part of function b and q, and this byte is part of
functions a, b, j, n, and z.
I imagine that when a compiler spots repetitive sections of code it
takes them out of the function's object code into a common area of the
object, and has the function point to them. That way redundant functions
could be safely removed.

And what makes you think that function1, function2, and "common area"
are labelled in a way that the linker can identify them? Sure,
the entry points are labelled. That's likely to be all the info
available.

Gordon L. Burditt
 
G

Gordon Burditt

The point here is that it may not have the information required to
In that case the object format should be changed :)

Using that standard, can you name any object format that should
NOT be changed? One in actual use, with an actual compiler that
generates it?

Gordon L. Burditt
 
D

Dave Vandervies

Gordon Burditt said:
I didn't say it did. I said that if you have two functions compiled
in a (object) file, and one of them isn't needed, there's no guarantee that
the linker can determine what is part of the needed function (and possibly
the other one also) to keep, and what is NOT part of the needed function
(to delete).

How hard can it be?

I mean, all you have to do is solve the halting problem...


dave
 
W

Walter Roberson

Geronimo W. Christ Esq wrote:
Each time I did, it came with a list of bugs, and a boss to prioritize them.

Each time I have taken over a large code base, the original authors
have no longer been available; there has been no list of bugs;
it has been up to me to figure out how the program is -intended-
to work and how it -really- works; it has been up to me to
create the list of bugs; and for the most part it has been up to
me to do the talking to the users to figure out what the
priorities of the various bugs (and missing features) are; it
has been up to me to do any necessary mathematical analysis to
figure out whether the formulae are correct (especially near the
boundary conditions); and it has been up to me to do any necessary
rewriting and restructuring and optimization.

If that sounds like, "Here's a big project: Fix it!", then
yeah, there's a fair bit of truth to that.

Most people don't seem to have the knack of tearing apart a fair-sized
program and rebuilding it, so such projects get left for me. There are
a lot of good programmers who can do very nice work on constructing
-new- code or debugging something they wrote (or which there is good
documentation for); it's a different skill-set to "reengineer" large
poorly-documented programs.
 
D

Dave Thompson

Using that standard, can you name any object format that should
NOT be changed? One in actual use, with an actual compiler that
generates it?
The object file format used on Tandem^WCompaq^WHP NonStop in TNS
(legacy) mode has completely disjoint code blocks (also data blocks),
with a copy of interroutine references sorted by target, so you need
only look at a single field to determine a routine is unreferenced.
There are still supported and used compilers (and runtimes) for at
least C, Fortran, and COBOL, and a cfront-based (less than Standard)
C++; there used to be Pascal, but I don't think it's still supported.
(The newer 'native' RISC tools are ELF, and full C++. The newest
Itanium ones I haven't seen yet.)

- David.Thompson1 at worldnet.att.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,166
Messages
2,570,902
Members
47,442
Latest member
KevinLocki

Latest Threads

Top