Pointers
The set of all pointers in the program is initialized at startup. They are either NULL or they point
to valid addresses, established in the raw data of the program. For instance:
int a,*pint = &a;
The set of valid addresses is established by the compiler at startup. When control arrives at main()
all pointers are valid.
Not startup. Pointer variables (objects) can be static (including all
file scope) in which case the objects exist at startup; or local,
created when the block (usually function) is entered, even
recursively, and "vanish" when it exits; or dynamic, created and
released as directed by program code. Static pointer variables are
initialized, to null by default or to the stated initializer, but it
is not necessarily obvious if the value is valid, consider:
extern int a [ /* bound not given, defined in other t.u. */ ];
int * p = &a[37]; /* have to look there to validate this */
Automatic pointer variables are indeterminate if not initialized, and
dynamic pointer variables always indeterminate until assigned unless
calloc is used and all-bits-zero is valid, as it is on many machines.
The set of valid addresses = pointer values, or equivalently of
objects that can validly be pointed to, similarly varies: it
includes static variables, local auto but not register variables, and
dynamically allocated aka heap space; the latter two have lifetimes
shorter (often much shorter) than the whole program execution.
Undefined behavior (what pointers concerns) is when any pointer is used that
1) Has not been initialized to point to an existing valid object.
or
2) Is NULL or points somewhere else than
Object start address <= p < (start address)+sizeof(Object)
Assuming by 'used' you mean dereferenced, almost; in the case of
pointing within a composite (array or struct) object, we further need
the pointer to be to a valid subobject (element or field), or any
character=byte. For a pointer (value) that is only computed, stored,
passed, returned, etc., we also allow == startaddr + size.
Note that 'existing' requires the object existed when the address was
taken, and the *same* object *still* exists when the pointer is used;
this is trivial for a static target, but not for the other durations.
We can distinguish two types of pointers:
A) Unbounded pointers, i.e. pointers where the calculation of sizeof(Object) is impossible
B) Bounded pointers where sizeof(Object) is known and can be checked at run time.
Detecting this class of UB is called bounds checking and is done in many languages.
C is notoriously lacking this facility. Worse, the machine is not used to automatically test the
programmer's assumptions and all pointers are considered unbound.
'Bounds checking' usually means checking of subscripts for arrays,
which are in (most) compiled languages the only actual objects that
can have differing sizes at runtime. C99 'flexible' structs are
stored in, and can access, differing space, but the type itself has
the fixed size, and layout, of the fixed part, only the array part is
variable. Various polymorphic pointers and references can point to
different types of object with different sizes, but each actual type
has a fixed size. Pascal variant records can have different sizes,
but each variant still has a fixed layout. C does not actually
prohibit runtime checking, but it does largely prevent the
optimizations that make it (mostly) practical for other languages.
Lisp, APL, and many other languages check array accesses and avoid memoy corruption. C doesn't, and
we are plagued by memory corruption and obscure bugs.
An improvement would be to encourage the automatic checking of object accesses and discouraging the
usage of unbounded pointers. Instead of writing:
void matmult(int n,int m, double *pmat)
we would write:
void matmult(int n, int m, double mat[n][m]);
Such a proto would allow to check in the calling program that the buffer passed has enough space as
declared, and in the called function it would be possible to check that no index is being misused.
Yes. And I believe you realize, but other readers may not, that this
syntax is standard in C99, but bounds checking (still) isn't.
(I would expect a function named matmult to multiply matrices, and
that requires three matrix arguments, or two and one return "value",
which of course in C must be done by returning a pointer to allocated
or reused memory or by wrapping a fixed-size array in a struct.)
In APL it is conventional to do whole-array operations whenever
reasonable, so fewer subscripts need actually be checked. This is
also supported and encouraged in HPF/Fortran>=90 and PL/I.
Most of this attitude comes because the Pascal language has this facility, and many C people see
Pascal as something quite horrible.
Pascal does not have quite the semantics you show; it (optionally) has
a form which binds(!) each bound (and in Pascal both lower and upper
bounds are specified, not just upper as in C) to an array formal:
procedure mat_something (mat: array [a:b, c:d] of real);
which can then be called with any 2-D array of real, and a,b,c,d used
(if necessary) to obtain the actual bound values. I'm not sure if you
can have multiple formals with the same variable-size (or as they call
it 'conformant') array type, and I'm pretty sure you can't have
mutiple formals with interdependent types, as e.g. actual matmult
needs: if A is [Xrange, Yrange], B must be [Yrange, Zrange] for some
Zrange, and result is [Xrange, Zrange].
Fortran has (since F77 IIRC) the form you describe, except slightly
different syntax, and since F90 also the equivalent of the Pascal form
(assumed-shape) except that the bounds are unnamed and instead
obtained using special builtin functions. I will call that 'attached'
since the bounds are carried with the array argument not written
separately, and your form 'separate'. I'm pretty sure PL/I has both.
Ada has attached (unconstrained array) and AFAICT prevents separate
except by instantiating a generic (with a different syntax).
I think that was a good feature of Pascal. I miss this in C and I see each day the consequences in
array overruns, obscure bugs, and many other problems. Encouraging the use of bounded pointers would
introduce some hygienical concepts isn't it?
Nobody is proposing banning unbounded pointers. They should remain for special uses or in old
software. Encouraging the use of bounded pointers will make them slowly les and less frequent,
that's all.
The sizeof calculation is very problematic in C because of the refusal of passing this information
in array prototypes by the standards comitee. Of course this has historical reasons, but I just do
not understand why in 2003 we still want to save us the few machine cycles that that would cost, and
spare the users and the programmers the stack overruns, memory corruption and other problems!
An array decays in C, to an unbounded pointer when passed to a subroutine. All sizeof information is
not passed along. This is (maybe) efficient but it is a problem for checking the bounds of array
indexes!
You can't pass bounds for arrays only and not pointers, because array
parameters *are* pointers; this is fundamental to the language, and
cannot be changed while still calling the result C. An implementation
can compute, store and pass (and check) bounds for *all* pointers, at
extra cost which most people apparently consider too high. You might
also want to store and check bounds for C99 'flexible' structs, either
hidden in the struct or in (all!) pointer-to-struct.
Or, you could design a callling sequence which passes bounds for
'array' arguments, similar to the way most Fortrans pass a 'hidden'
length of (there fixed-size) character-string arguments, but not keep
bounds in explicit pointers, so where a pointer has passed through a
variable or out-of-line function, and probably certain kinds of
conversions/casts, the bound information would become 'unknown --
can't check this one'. And I suspect so much C code does (and would
still do) this as to make the feature worth little, though legal.
In C++ you can create a bounded_ptr (probably template) which
keeps/passes (and checks) a bound in exactly the cases you specify a
bounded_ptr, and no others, and I think, without actually working out
the details, that you can have the (visible) code optimize away in
cases determined at compile time. Or you can write a template
function that deduces the bound of an array parameter where known, or
uses 'unknown' for a pointer; to avoid code bloat you could (and I
would) have it inline into code that passes the bound explicitly to a
shared implementation, but then AFAICT you can't prevent a caller from
cheating and directly specifying a wrong bound.
- David.Thompson1 at worldnet.att.net