Bounds checked arrays

B

Ben Pfaff

Keith Thompson said:
The declaration of arr and ptr, the initialization of ptr, and the
evaluation of ptr[15] can be almost arbitrarily complex and can be
widely separated, even in separate source files. If you want to do
reliable bounds checking on ptr[15], you have to carry the bounds
information along with the pointer.

In other words, you need fat pointers.

An interesting paper on this subject was recently published as
Ruwase and Lam, "A Practical Dynamic Buffer Overflow Detector",
at NDSS 2004. It's available from
http://suif.stanford.edu/papers/tunji04.pdf
Typical performance penalty appears to be about 25% and it only
handles string buffers (because those are typical sources of
security problems) not general arrays.
 
M

Martin Dickopp

Keith Thompson said:
Perhaps, rather than a "#pragma", bounds-checked pointers should be
specified with a new type qualifier.

That implies that the programmer is careful to apply the new qualifier.
But if the programmer is that careful, why doesn't s/he write correct
code in the first place?

IOW, I wouldn't find bounded pointers useful in new, carefully designed
code, but I would rather like to be able to apply them to all the sloppy
code already out there.

Martin
 
K

Keith Thompson

Martin Dickopp said:
That implies that the programmer is careful to apply the new qualifier.
But if the programmer is that careful, why doesn't s/he write correct
code in the first place?

IOW, I wouldn't find bounded pointers useful in new, carefully designed
code, but I would rather like to be able to apply them to all the sloppy
code already out there.

Hmm. I tend to think that all software has bugs. If you've found a
way to eliminate all bugs by being careful, I'm very impressed.

What I'd like to suggest is adding a qualifier that specifies a
pointer type is *not* bounds-checked (probably with a #pragma and/or
command-line option to override all bounds-checking), but I suspect it
can't be done compatibly.
 
M

Martin Dickopp

Keith Thompson said:
Hmm. I tend to think that all software has bugs. If you've found a
way to eliminate all bugs by being careful, I'm very impressed.

Unfortunately, I haven't. :)

(Well, I could argue that it is trivial to eliminate bugs by being
careful, but impossible to always be that careful...)
What I'd like to suggest is adding a qualifier that specifies a
pointer type is *not* bounds-checked

I'd find that much more useful than /enabling/ bound checkings by a new
qualifier. Primarily, my point is that I want to be able to recompile
existing code and have bounds checking.

Martin
 
J

Joona I Palaste

Nick Landsberg said:
One could even argue that compiling differently (with and
without bounds checks) that it is a different language
being compiled, with bounds checks being "not C"
(or is that !C ? ... is !C == D ?, ...
or should that be P instead of D ?, but I digress. :)

I thought we weren't supposed to talk about those nasty off-topic !C's.
<g, d & r>
 
P

Phil Tregoning

Yes, but I don't think anything in the C standard /forbits/ an
implementation to check array bounds. From the point of view of the
standard, accessing an out of bounds array element causes undefined
behavior, so the implementation is free to (e.g.) terminate the program.

<OT>
FWIW, there is or was an attempt to implement bounds checking in the
GNU C compiler. I don't know what the current state is.
</OT>

FWIW, the VMS C compiler offers bounds checking as a compiler
option. It only works on real arrays (not pointers). There is
a description of usage and limitations here:

http://h71000.www7.hp.com/commercial/c/docs/5492p002.html#bounds_check_sec

They can be summed up as:

o Only works on real arrays.
o Allows address one-past-the-end to be taken.
o Disables checks on arrays in a struct of size one (to allow the
"struct hack").
o Each separate subscript is checked in multidimensional arrays
(so "int a[10][10]; a[0][12] = 0;" counts as out-of-bounds).

If an out-of-bounds access is discovered during compilation the
compiler emits a warning and continues.

If an out-of-bounds access is discovered during run-time the
program exits with a "SYSTEM-F-SUBRNG, arithmetic trap, subscript
out of range at PC..." error (which counts as a SIGFPE signal and
can be trapped).

Because it doesn't work on pointers, I don't find it very useful.

Phil T
 
D

Dan Pop

In said:
As everybody knows, the C language lacks
a way of specifying bounds checked arrays.

And there is a fundamental reason for that: the C's ability to alias
everything with character pointers renders the problem practically
insolvable.

Consider the trivial example:

char a[5][3], *p = (char *)a + 10;

What should be the limits of p and why? Things get even cloudier when
dealing with pointers that alias arrays embedded into structures:

struct foo {
int i;
char a[10];
double x;
} bar;

char *p = (char *)&bar + offsetof(struct foo, a) + 5;

Is p bounded by the array a or by the struct foo? If the compiler doesn't
read the programmer's mind properly, it will either generate an undesired
alert or quietly allow an out of bounds access.

Or how about unions:

union foo {
char a[10];
int i[10];
float x[10];
} bar;

char *p = (char *)&bar + 5;

Bound checking is well defined on most languages which either don't
support pointers at all (Fortran <= F77) or have a very restricted notion
of pointers (Fortran >= F90 and Pascal).

In C, limiting the bound checking to expressions containing the array
identifier itself is far from providing any kind of safety (think buffer
overflows a la gets) and going beyond this is incredibly difficult for
reasons partly described above.

There is also the library issue: each implementation would have to come
with two versions of the libraries: with bound checking and without.
Otherwise, inserting checks only in the user code is not enough (again,
think gets).

Sorry, but C isn't the language for people needing bound checking.
Unfortunately, far too many such people do program in C... And even if
bound checking is eventually introduced, most of those people would
be the last ones to enable it in their code ;-)

But feel free to experiment with it in your compiler and see what
happens.

Dan
 
C

CBFalconer

osmium said:
n is, in general, unknown in functions that have arrays passed
to them. This presents serious problems to your scheme.

This can be cured, with heavy run time expense, by making all
pointers indirect such that they describe an area, its limits, and
an offset within that area. The pointers themselves need not grow
excessively, since one field can index a description array that
need not be unique to that pointer, and another field can hold the
offset. All this requires no changes in the standard. I have
outlined such a system before, with no overwhelming response.
Something recent, either here or in comp.arch.embedded in the past
week or so, indicated that IBM created some such machine name
???400???.

For example, a stack oriented machine will have an entry for the
stack as a whole, for each stack frame, and for each declared
object in any frame. Things will grow because even single byte
objects must have such an entry in order to be able to create and
pass pointers to them. Similarly anything created by malloc will
require at least one entry.

Exit from a stack frame (assuming again a stack machine) will
probably need to destroy entries, or at least mark them invalid.
This brings up the problem of dangling references, and detection
thereof.

Whereever you look, the C practice of brandishing pointers
indiscriminately poses apparently insoluble barriers to the
creation of accurate self-checking code.

My conclusion is that the language should be used in its original
mode - as structured assembly - and that critical code should be
written in better suited languages that have been designed for the
task.
 
C

CBFalconer

jacob said:
osmium said:
n is, in general, unknown in functions that have arrays passed
to them. This presents serious problems to your scheme.

This must be changed. You write:

int fn(int array[2][3]);

Meaning that this function accepts only arrays of
dimensions 2 lines three columns. Inside the
function the dimensions are known.
Or you use *bounded* pointers.
You write:
int fn(size_t lines, size_t cols,int array[lines][cols]);

This information must be passed around.

This is built into the language, if you simply use Pascal, Modula,
or Ada. C will never be altered in directions that seriously
break existing code.

Halfway improvements are probably even more dangerous than no
improvements, because they give programmers a false sense of
security.
 
M

Malcolm

CBFalconer said:
My conclusion is that the language should be used in its original
mode - as structured assembly - and that critical code should be
written in better suited languages that have been designed for the
task.
Or maybe run the bounded pointer implementation for a debug and testing
session.
The problem you face is that if you have a buffer overrun then an error has
already occured, because buffer overflows happen for a reason.
 
K

Keith Thompson

Sorry, but C isn't the language for people needing bound checking.
Unfortunately, far too many such people do program in C... And even if
bound checking is eventually introduced, most of those people would
be the last ones to enable it in their code ;-)

I'm curious: which people *don't" need bounds checking?
 
K

Keith Thompson

CBFalconer said:
This can be cured, with heavy run time expense, by making all
pointers indirect such that they describe an area, its limits, and
an offset within that area. The pointers themselves need not grow
excessively, since one field can index a description array that
need not be unique to that pointer, and another field can hold the
offset. All this requires no changes in the standard. I have
outlined such a system before, with no overwhelming response.
Something recent, either here or in comp.arch.embedded in the past
week or so, indicated that IBM created some such machine name
???400???.

AS/400?

Can you provide a pointer to your proposal? I'd be interested in
seeing it.

Indexing into a description array saves space in each pointer, but
since each pointer dereference has to refer to the bounds information,
putting it in the pointer itself might make for faster execution.

[...]
Whereever you look, the C practice of brandishing pointers
indiscriminately poses apparently insoluble barriers to the
creation of accurate self-checking code.

My conclusion is that the language should be used in its original
mode - as structured assembly - and that critical code should be
written in better suited languages that have been designed for the
task.

It might be an interesting exercise to design a new language (as close
to C as possible, but no closer) that incorporates this kind of bounds
checking. If it can't be done without breaking existing code, then
there's little chance of the new language being called C, but there
seems to be a market advantage in designing new languages with C-like
syntax, even if they're incompatible (see Java and Perl, for example).
If a lot of existing C code can be ported to the new language without
too much effort, it might even catch on. It would probably need to
have a mechanism for working with C-style skinny pointers so it could
interface to existing C code (OS interfaces, for example); such a
mechanism should be very explicit so programmers aren't tempted to use
it by default (perhaps a type qualifier called "dangerous").

And I think I've just jumped off the cliff of topicality into the icy
fjord of speculative language design.
 
R

Rob Thorpe

In said:
As everybody knows, the C language lacks
a way of specifying bounds checked arrays.

And there is a fundamental reason for that: the C's ability to alias
everything with character pointers renders the problem practically
insolvable.

Consider the trivial example:

char a[5][3], *p = (char *)a + 10;

What should be the limits of p and why? Things get even cloudier when
dealing with pointers that alias arrays embedded into structures:

struct foo {
int i;
char a[10];
double x;
} bar;

char *p = (char *)&bar + offsetof(struct foo, a) + 5;

Is p bounded by the array a or by the struct foo? If the compiler doesn't
read the programmer's mind properly, it will either generate an undesired
alert or quietly allow an out of bounds access.

Or how about unions:

union foo {
char a[10];
int i[10];
float x[10];
} bar;

char *p = (char *)&bar + 5;

Bound checking is well defined on most languages which either don't
support pointers at all (Fortran <= F77) or have a very restricted notion
of pointers (Fortran >= F90 and Pascal).

In C, limiting the bound checking to expressions containing the array
identifier itself is far from providing any kind of safety (think buffer
overflows a la gets) and going beyond this is incredibly difficult for
reasons partly described above.

There is also the library issue: each implementation would have to come
with two versions of the libraries: with bound checking and without.
Otherwise, inserting checks only in the user code is not enough (again,
think gets).

Just because it is impossible to make bounds checking bulletproof
doesn't mean it is not a good idea. I think most of us who write C
would be thankful for it if it. I would be even if it only catches a
few mistakes during coding.

The problem is explaining to people that the bounds checking is not
watertight. Even this shouldn't pose too much of a problem.
Sorry, but C isn't the language for people needing bound checking.
Unfortunately, far too many such people do program in C... And even if
bound checking is eventually introduced, most of those people would
be the last ones to enable it in their code ;-)

But feel free to experiment with it in your compiler and see what
happens.

Exactly, lets wait and see what the problems turn out to be.
 
R

Rob Thorpe

jacob navia said:
As everybody knows, the C language lacks
a way of specifying bounds checked arrays.

This situation is intolerable for people that know
that errors are easy to do, and putting today's
powerful microprocessor to do a few instructions
more at each array access will not make any
difference what speed is concerned.

Not all C applications are real-time apps.

Besides, there are the viruses
and other malicious software that are using
this problem in the C language to do their dirty
work.

Security means that we avoid the consequences
of mistakes and expose them as soon as possible.

It would be useful then, if we introduced into C

#pragma STDC bounds_checking(ON/OFF)

When the state of this toggle is ON, the compiler
would accept declarations (like now)

int array[2][3];

The compiler would emit code that tests
each index for a well formed index.
Each index runs from zero to n-1, i.e.
must be greater than zero and less than
"n".

In arrays of dimension "n", the compiler would
emit code that tests "n" indices, before using
them.

Obviously, optimizations are possible, and
good compilers will optimize away many tests
specially in loops. This is left unspecified.

Important is to know that the array updates
can't overflow in neighboring memory areas.

How many machine instructions does this cost?

Each test is a comparison of an index with a
constant value, and a conditional jump. If the
compiler only emits forward branches, the
branch predictor can correctly predict that in
most cases the branch will NOT be taken.

In abstract assembly this is 4 instructions:
test if index >= 0
jump if not "indexerror"
test if index < "n"
jump if not "indexerror"

where "n" is a compile time constant.

We have something like 4 cycles then, what
a 2GHZ machine does in 0,000 000 004 seconds.

Yes, table access is a common operation but
it would take millions of those to slow the program
a negligible quantity of time. We are not in the
PDP-11 any more.

This would make C a little bit easier to program,
and the resulting programs of better quality.
Buffer overflows happen of course, but the language
limits the consequences by enforcing limits.

By default the behavior is to stop the program.
The user can override this, and different schemas
can be specified by him/her to take actions when
a buffer overflow happens.

A simple strategy is to just do nothing.

int fn(char *input)
{
char tmpbuf[BUFSIZ];
int i=0;
bool result = false;

while (*input) {
tmpbuf[i++] = *input++;
}
// Do things with the input
// set result
return result;
indexerror:
return false;
}

This function uses the built-in error checking
to avoid any bad consequence for an overflow.
If the input data is too long, it is a mal-formed
input that should be discarded.

This frees the programmer from the tedious task
of writing
if (i >= sizeof(tmpbuf)) goto indexerror;

at EACH array access. This can be done better
by a machine and the compiler.

Because a program like that today
***assumes*** the input length
can't be bigger than BUFSIZ.

This is always *implicitely* assumed and
nowhere *enforced* by the way. The current
state implies that catastrophic errors can happen
if the index starts overwriting separate memory
areas like the return address...

Everyone knows this. Let's do something to
stop it. Something simple, without too much
fuzz.

In this case the compiler generates code that
in case of index error
jumps to this label and does what the programmer
specifies.

The motto of C is that: Trust the programmer.

We have just to allow him/her to specify what to do
in case of overflow.

Trust the programmer doesn't mean that we trust
that he never does a mistake of course. It means
that the programmer can specify what actions
to take in case of error and provide sensible
defaults.

Default is then, to finish the program like the
assert() macro, another useful construct.

Note that this proposal doesn't change anything
in the language. No new constructs, even if
compilers could provide arrangements like the
one proposed above.

I propose then:

#pragma STDC bounds_checking(ON/OFF)

that should be written outside a function scope.

That's all.

This proposal is an invitation to
brain-storming..:)

I know that anyone using C is aware of this.
So, let's fix it.

Sounds like a good idea. Since nothing gets standardised without
someone doing it first, why not implement it in LCC, then see how many
problems are encountered.

Before you do read:
http://www.doc.ic.ac.uk/~phjk/BoundsChecking.html

this is how it was done in TCC.

Perhaps try to make it work the same way as TCC to gets things started
without initial compatibility problems.
 
N

Nick Landsberg

Rob Thorpe wrote:

The problem is explaining to people that the bounds checking is not
watertight. Even this shouldn't pose too much of a problem.

I was with you until this statement.

The problem is explaining this to Managementcritters,
who don't know one language from another, have never
written a line of code in their lives, and latch on
to the latest buzzword as a panacea for everything
that's wrong with software development.

"We're gonna have bounds checking in C! That means
we can save time by eliminating the code reviews!
We can get the product out the door faster!"

And yes, the innuendo that they are not really people was
intentional. :)
 
J

Joe Wright

Keith said:
I'm curious: which people *don't" need bounds checking?
Dan can speak for himself, of course, but I don't need it and I suppose
you don't need it. We know what C is and we do our own bounds checking
when we write our programs. People *need* bounds checking by the
Language/Compiler presumably because they are unwilling or unable to do
it themselves.
 
N

Nick Landsberg

Joe said:
Keith said:
(e-mail address removed) (Dan Pop) writes:
[...]
Sorry, but C isn't the language for people needing bound checking.
Unfortunately, far too many such people do program in C... And even if
bound checking is eventually introduced, most of those people would
be the last ones to enable it in their code ;-)

I'm curious: which people *don't" need bounds checking?

Dan can speak for himself, of course, but I don't need it and I suppose
you don't need it. We know what C is and we do our own bounds checking
when we write our programs. People *need* bounds checking by the
Language/Compiler presumably because they are unwilling or unable to do
it themselves.

Or have not been trained in the school of professional paranoia. :)

As I mentioned upthread, on the systems on which I work, it is
unacceptable to have the implementation take what I would
presume to be the default action of stopping the program
if a compiler generated bounds check were exceeded.
This may not be your environment, but it is mine.

Even when a language that has bounds checking is used in
our stuff, care is taken to ensure that no run-time
bounds checks are exceeded. The program logs an
error, ditches the the transaction, and goes on to
the next. I choose to call this "programming discipline."

Thus, I see no real need for compiler-generated bounds
checking in production code. I see it as a very
useful debugging tool, tho (e.g. during my testing,
I discover that I have exceeded array bounds. I will
than put in the code which I forgot which checks the
array bounds).
 
R

Rob Thorpe

Nick Landsberg said:
Rob Thorpe wrote:



I was with you until this statement.

The problem is explaining this to Managementcritters,
who don't know one language from another, have never
written a line of code in their lives, and latch on
to the latest buzzword as a panacea for everything
that's wrong with software development.

"We're gonna have bounds checking in C! That means
we can save time by eliminating the code reviews!
We can get the product out the door faster!"

And yes, the innuendo that they are not really people was
intentional. :)

You are undoubtably right that many people will jump to incorrect
conclusions. Probably some stupider managers will misunderstand it.
However, if your managers jump on a minor programming language
development as a reason to change the development process then they
are idiots. And if that is the case you have bigger fish to fry.
 
R

Richard Bos

Keith Thompson said:
I'm curious: which people *don't" need bounds checking?

People who have already done explicit bounds checking on their data the
moment they get it from the outside, and know that all data that's
passed the tests can be trusted.

Richard
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,141
Messages
2,570,813
Members
47,357
Latest member
sitele8746

Latest Threads

Top