What's the deal with size_t?

  • Thread starter Tubular Technician
  • Start date
M

Malcolm McLean

Richard Heathfield said:
ITYM non-negative.
Only if you count from zero, which computers do but mathematicians don't.
I wish all mine did, too. Mostly, however, they do. When they don't, I
consider it to be my fault, not C's fault. In almost all circumstances, C
provides a mechanism for getting any C feature right, so that getting it
wrong is my problem, not C's problem. In the few circumstances where this
is not the case, I will tend to avoid that particular feature (gets() is
an example, as are VLAs (although of course this is not the *only* reason
I avoid VLAs).
You shouldn't use the overflow trap as the intended error-handling mechanism
of your program. However if you make a mistake, it's there as a last line of
defence.
Except in a few cases, like video games, the worst thing you can do is
return wrong but plausible-seeming results, not to terminate.
 
R

Richard Heathfield

Malcolm McLean said:
Or used to derive index calculations. Which almost certainly you are
doing with the string lengths. That's what I mean by "ultimately".

Then we have different meanings for the word "derive" (as is further
demonstrated by your later claim that argc is used to derive index
calculations).

if you say
if(argc == 3)
printf("%s %s\n", argv[1], argv[2]);
else
printf("must have 2 arguments\n");

you might want to argue that it isn't being used to derive the 1 and the
2 indices.

Indeed, and that's what I mean about us using the word "derive" in
different ways.

Counts of things in memory are almost always used ultimately to index
things.

Again, I don't agree. For example, in my longer example, wcount was used to
count hits in a search. The results were displayed as they were found, and
wcount was just used to count them, so that I could write this:

printf("\n%lu word%s matched.\n",
(unsigned long)wcount,
wcount == 1 ? "" : "s");

at the end of the report. At no time was wcount used to index anything.
Now, *sometimes*, yes, counts are used to limit indexing loops, and I
don't think anyone disputes that, but I think you are overstating the case
when you say "almost always".
No, that should give you enough to go on. I've marked with an asterisk
everything I would say is not an index. Obviously I can't see your code,
so I've put in a few question marks, and the range trackers may not be
used to derive indices, but I find this very difficult to believe.

Well, I can't do much about your beliefs, but I did pick out your question
marks (I spotted two), and and went back to check them. One, 'Count', is
used to calculate a percentage. No indexing involved. The other, 'width'
(horizontal space tracker), is used like this:

width = 0;
while(q != NULL)
{
printf("%s ", q->Word);
++wcount;
width += len + 1;
if(width > ScreenWidth - 2 * len)
{
width = 0;
printf("\n");
}
q = q->Next;
}

Again, I see no indexing calculations there.

Maybe a better of saying what I am getting at would be "given that every
array in this program could potentially take up all available memory, how
many of my variables need to be size_t's?" The answer, as suggested by
your code, which has been well-written, is the vast majority. Now ask,
"how many are size_t's?" In your case, the vast majority. But what of
lesser programmers?

Your words are kind, indeed overly kind, but I think we need to be careful
about catering for the lowest common denominator. C isn't about pandering
to "lesser programmers", but about providing power, control, and
portability to those who can use it.
 
R

Richard Heathfield

Malcolm McLean said:
Only if you count from zero, which computers do but mathematicians don't.

In comp.lang.c, we can reasonably assume that we're talking about C, not
mathematics, unless a clear statement to the contrary is made.

You shouldn't use the overflow trap as the intended error-handling
mechanism of your program.

I agree.
However if you make a mistake, it's there as a
last line of defence.

I disagree. It is not guaranteed to be there.
Except in a few cases, like video games, the worst thing you can do is
return wrong but plausible-seeming results, not to terminate.

The *best* thing you can do is to present information to the user that is
adequate for him or her to work out what's wrong and take appropriate
corrective action. This might involve re-entering data, or checking and
fixing file inputs, or even contacting the program supplier (i.e. you!)
with a bug report that is sufficiently detailed to enable the supplier to
find the precise cause of the fault.
 
C

Charlie Gordon

Richard Heathfield said:
Charlie Gordon said:

<test info snipped - see upthread>

Test datum #1: index variables: 0% (sample size: 255 lines)
Test datum #2: index variables < 35% (sample size: around 600 lines)



No, they aren't. Colloquially, "most" means "nearly all", which is clearly
not true, and strictly speaking, "most" means "more than half", and not
even /that/'s true. In one of the samples, the number was a big fat zero,
and in the second, it was considerably less than half, indeed only
slightly over one third.

You conveniently snipped all the evidence, so here it is:

First example:
int rc: return code
size_t len: string length (i.e. count of char objects)
size_t longest: measure of longest string constructed (i.e. count of char
objects)
size_t maxlinelen: measure of longest line encountered (i.e. count of char
objects)
size_t n: line count
int first: flag

NONE of these objects is used as an index into an array.

len, longest, maxlinelen are string lengths: they measure a count of objects
in an array, which is why they were made size_t. Unless n is intended to
count lines for allocating an array, size_t is not the proper type for it:
either you know that there are never more than 65535 lines in the file and
unsigned int is sufficient, or you don't and long or unsigned long should be
your choice.

Total: at least 50% of integers are size_t because they measure an array or
index into one.

Second example:
int Status: return code
size_t ThisPattern: used as index
size_t len: line length (i.e. object count)
int Found: flag
size_t SpinnerControl: used as index
int LineCount: line count
size_t len: used as index (this is in a different function to the other
len)
size_t pattern: used as pointer offset, which we'll count as an index
int Status: return code
size_t ThisPattern: used as index
size_t len: used as pointer offset, i.e. index
int Found: flag
size_t wcount: word count
size_t width: keeps track of how much horizontal space an output line
takes
up
int Hit: flag
int Status: return code
size_t ThisPattern: used as index
size_t len: current line length
size_t wcount: word count
size_t width: horizontal space tracker
size_t idx: used as index
size_t j: used as index
int done: flag
size_t curr: used as index
size_t i: used as index
size_t Size: tracks current buffer size
size_t BytesRead: tracks number of input bytes
int Status: return code
size_t pos: records position of a letter in the alphabet
size_t ThisEntry: used as index
size_t ThisByte: used as index
size_t Count: counter
size_t pos: records position of a letter in the alphabet
size_t Freq: used as index
size_t ch: used as index
size_t Start: tracks starting position
size_t End: tracks ending position
size_t RangeStart: tracks start of range
size_t RangeEnd: tracks end of range
size_t LineLength: tracks line length
size_t ch: used as index
size_t Start: tracks starting position
size_t End: tracks ending position
size_t RangeStart: tracks start of range
size_t RangeEnd: tracks end of range
size_t LineLength: tracks line length

46 total variables.
16 variables used as index per your own admission.
7 variables use to measure strings or parts of strings, ie counts of objects
in arrays.
wcount and pos are counts of objects for which size_t is not the proper
type.
RangeStart and RangeEnd track ranges: cannot tell from context if they
qualify as offsets in an array or as counts of objects unrelated to memory,
such as lines, for which you chose the type int anyway.
Of course argc also qualifies as the count of objects in the argv array.

Again at least 50% of integer variables are size_t because they measure an
array or index into one.
Unless of course *you* are playing with words, and claiming that, if less
is more, then fewest is most?

I'm just trying to use common sense to explain what I think Malcolm means in
his assertion: type size_t is only sensible for variables that represent
sizes, indices, and counts of objects in arrays in memory. A substantial
amount of variables in C programs are used for such a purpose and should be
made size_t instead of int or unsigned int if the arrays they refer to can
be of arbitrary size. Aside from those uses, variables used for counts of
objects with no relation to memory should use a type appropriate for their
assumed maximum value, not necessarily size_t.
I don't see how.

I hope to have made this clearer, not precisely with his assertion, but with
my interpretation of what he means. "Most" is probably too much, but to say
that integer variables are "primarily" used for such purposes as indexing or
measuring object counts in memory arrays seems right to me.
 
C

Charlie Gordon

Richard Heathfield said:
Malcolm McLean said:



As I understand it, the basis you advance for your claims is two-fold:

(a) the size_t name is ugly;
(b) the size_t type is unsigned.

I have already pointed out that I agree, pretty much, with (a), but that I
don't consider it to be a particularly persuasive or meritorious argument
for abandoning or deprecating size_t. It might make a reasonable argument
for suggesting a name change, although of course the chance of getting ISO
to agree on a name change is, in reality, zero.

Sad but true.
The fact that size_t is unsigned fits naturally with its role (as
demonstrated by the cases in which the standard library uses it) as a way
for storing object sizes and object counts. A negative size for an object
is meaningless, as is a negative count of the number of objects (in the C
sense of the word). So I don't consider this argument to be particularly
persuasive or meritorious either, because the unsignedness of size_t is
perfectly natural and sensible, given the nature of its intended purpose.

The reason it is unsigned is not so much what you expose, but more likely to
enable object sizes upto the largest number that fits in a given word size.
Of course object sizes are positive, but unsigned arithmetics have a
singularity on 0 that causes hard to find bugs, especially in expressions
where signed quantities are mixed with unsigned quantities.
To write

size_t i;

for(i=0;i<N;i++)
{
ptr++;
}

is misleading, because it implies that i is a "size type" when it is
nothing of the sort. It is an index.


It's an object count. It measures the distance, expressed in object units,
between the start of the array and the point in that array where can be
found the object that we care about. This is entirely consistent with the
usage of size_t in functions such as fread, fwrite, and calloc.


It is very misleading to describe size_t as measuring distances: the
distance between two pointers is a *signed* quantity of type ptrdiff_t, that
may not even fit in this type. It is possible to have this:

size_t i = <some large value>;
size_t j = <some other value>;

``i < j'' implies ``&array < &array[j]'', but not necessarily
``&array[j] - &array > 0''.

This fundamental inconsistency is problematic. Contraining SIZE_MAX or
making size_t a signed type would fix it, but would be incompatible with
current usage.
 
R

Richard Heathfield

Charlie Gordon said:
You conveniently snipped all the evidence

It is appropriate to snip material unless I am directly commenting on it.
In this case, I was commenting not on the material I snipped but on your
claim that I was playing with words. If you meant "conveniently"
literally, fine, and I'm glad I was able to present my reply in a way that
you found convenient. But if you meant it ironically, it's a baseless
slur.
, so here it is:

First example:


len, longest, maxlinelen are string lengths: they measure a count of
objects
in an array, which is why they were made size_t.

len is populated like this: len = strlen(rest) + 1;

and used like this:

if(len > longest)
{
size_t prev;
longest = len;
while(prev = longest, longest &= (longest - 1))
{
continue;
}
longest = prev * 2;
}

None of them is used for indexing into an array.
Unless n is intended to
count lines for allocating an array, size_t is not the proper type for
it: either you know that there are never more than 65535 lines in the
file and unsigned int is sufficient, or you don't and long or unsigned
long should be your choice.

Agreed. Amazing what turns up in these discussions, isn't it? It should be
unsigned long (and now is).
Total: at least 50% of integers are size_t because they measure an array
or index into one.

Sure, but that's not what Malcolm is saying. He's saying that most integers
are used as indices, or are ultimately used to derive indices. What you
are trying to show is that those integers that I have deliberately chosen
to be of type size_t are used for object counts or sizes. Yes, *some* of
those size_t are used for indexing into an array, but his claim was that
most (i.e. at the very least, more than half) integral type objects are
used for indexing into arrays. My data showed that this claim was not true
in the given arbitrary sample. If you want to argue that more than half my
size_t objects were used for indexing into an array, well, I would not be
surprised if that were true, but it turns out that it isn't true in that
example.
Second example:


46 total variables.
16 variables used as index per your own admission.

Oh, please don't be so dramatic. It's not a question of admitting this and
confessing that, but a question of examining Malcolm's claim that most
integers are used as indices into arrays.
7 variables use to measure strings or parts of strings, ie counts of
objects in arrays.

Yes. Counts are not indices, however.
wcount and pos are counts of objects for which size_t is not the proper
type.

I agree about wcount (which I've now fixed to be unsigned long).
RangeStart and RangeEnd track ranges: cannot tell from context if they
qualify as offsets in an array or as counts of objects unrelated to
memory, such as lines, for which you chose the type int anyway.

They mark the lower and upper limits of a loop whose counter is used for
indexing into an array. I agree that the loop counter is used for indexing
(and I recorded it as such), but I do not agree that the limits are used
for indexing.
Of course argc also qualifies as the count of objects in the argv array.

Yes. It's a count. It ought to be size_t.
Again at least 50% of integer variables are size_t because they measure
an array or index into one.

Again, that's not Malcolm's claim. He says that most integers are used for
indexing arrays. I will agree that size_t is often used for indexing
(which is really another way of saying "counting objects"), but there are
many more uses for integers than mere indexing, important as that use
undoubtedly is.
I'm just trying to use common sense to explain what I think Malcolm means
in his assertion: type size_t is only sensible for variables that
represent
sizes, indices, and counts of objects in arrays in memory.

I agree that those are the proper uses of size_t - and they describe how I
use it myself (except when I mistakenly use it for other things, and
you've spotted a couple of those yourself). But I don't think that's what
Malcolm means at all.

"Most" is probably too much,
Right.

but to
say that integer variables are "primarily" used for such purposes as
indexing or measuring object counts in memory arrays seems right to me.

Fine, but it doesn't seem right to me. It seems to me that it would be
closer to the mark to say that indexing is one of the very many important
uses to which integers are put. If he means that integers are used more
frequently for indexing than for any other single purpose, then I might
even agree (or at least not bother to disagree), but to claim that this
frequency exceeds 50% seems to me to be an exaggeration.
 
R

Richard Heathfield

Charlie Gordon said:
"Richard Heathfield" <[email protected]> a écrit dans le message de
news: (e-mail address removed)...
[An array index is] an object count. It measures the distance,
expressed in object units, between the start of the array and
the point in that array where can be found the object that we
care about. This is entirely consistent with the usage of
size_t in functions such as fread, fwrite, and calloc.

It is very misleading to describe size_t as measuring distances:

I was actually describing an ***array index***, and that's precisely what
it does - it gives you the number of objects between 0 and 'this' object.
Thus, if "object units" can be considered a unit of measurement, an array
index measures an offset in object units.
 
R

Richard Tobin

[An array index is] an object count. It measures the distance,
expressed in object units, between the start of the array and
the point in that array where can be found the object that we
care about. This is entirely consistent with the usage of
size_t in functions such as fread, fwrite, and calloc.
It is very misleading to describe size_t as measuring distances:
[/QUOTE]
I was actually describing an ***array index***, and that's precisely what
it does - it gives you the number of objects between 0 and 'this' object.

C doesn't have array indexes. In an expression like ptr, i is the
second operand of the subscripting operator, whose other operand is a
pointer, not an array. It measures the signed displacement from ptr
in object units.

size_t is only suitable for this when the displacement happens to be
always positive (which of course must be the case when the first
operand results from conversion of an array).
Thus, if "object units" can be considered a unit of measurement, an array
index measures an offset in object units.

"offset" is better, because it doesn't have the implication of
unsignedness that "distance" and "number of objects" have.

-- Richard
 
J

James Kuyper

Richard said:
Richard Heathfield writes in flowery prose that sometimes appears to be
designed to confuse non native speakers from what I can gather.

I cannot address the question of his intent; only he knows for sure what
he intended. But I know from personal experience that confusing
non-native speakers is not especially difficult, and needn't be ascribed
to deliberate intent. I'm married to one, and work with several others,
and avoiding confusion requires constant effort; preventing confusion
completely seems impossible. My own prose is sufficiently complicated
that I frequently unintentionally confuse native English speakers.
Isn't that like saying "when he thinks he's right he thinks he's right?
Or is my parser now broekn?

No. Saying "I disagree" in this kind of context is basically equivalent
to saying "Your argument has insufficient merit". Saying that "Your
argument has no merit" is a much stronger assessment. IMO, it was also a
correct assessment in this case.
 
R

Richard

[An array index is] an object count. It measures the distance,
expressed in object units, between the start of the array and
the point in that array where can be found the object that we
care about. This is entirely consistent with the usage of
size_t in functions such as fread, fwrite, and calloc.
It is very misleading to describe size_t as measuring distances:
[/QUOTE]

A length of a string is a distance. So is the size of a malloc.

But in most API functions that return size_t we are always talking about
number of bytes. Not number of elements.
C doesn't have array indexes. In an expression like ptr, i is the


Of course C has array indices. They are do not have a specific type, but
it does feature indexing into arrays.
 
R

Richard Tobin

C doesn't have array indexes. In an expression like ptr, i is the
[/QUOTE]
Of course C has array indices. They are do not have a specific type, but
it does feature indexing into arrays.

Of course you can *do* array indexing, but you do it with a more
general mechanism in which the subscript is not inherently unsigned.

-- Richard
 
R

Richard

C doesn't have array indexes. In an expression like ptr, i is the

Of course C has array indices. They are do not have a specific type, but
it does feature indexing into arrays.

Of course you can *do* array indexing, but you do it with a more
general mechanism in which the subscript is not inherently unsigned.

-- Richard[/QUOTE]

I was point out that the statement "C doesn't have array indexes" is
somewhat misleading.
 
F

Flash Gordon

Richard Heathfield wrote, On 12/11/07 09:42:
Malcolm McLean said:


In comp.lang.c, we can reasonably assume that we're talking about C, not
mathematics, unless a clear statement to the contrary is made.

<snip>

Actually, postive means greater than zero in mathematics, so if as
Malcolm suggested one is using the maths definition you (Richard) were
still correct in your correction. The same applies in English according
to a brief check of a few dictionaries.
 
F

Flash Gordon

Malcolm McLean wrote, On 11/11/07 22:16:
Ultimately you have to come to some sort of end, preferably with agreed
facts.

For instance one of my cases against size_t is that it is a major change
to the language, because most integers are ultimately used as indices.
When challenged on that I give a little bit of statistical evidence, but
when challenged on that evidence, I give up.

There are major problems with the evidence you presented, such as
excluding large application domains that C is used for thus making it
clearly not representative of C usage.
There's no helping some
people.

If you want to convince people you have to provide a convincing argument
and be prepared to address any issues people raise.
The assertion about indices is fairly easily verified - as long
as you know what you are talking about.

If it is so easy why can you not present evidence which does not have
major flaws?
Go through a sample of C code, and count every instance of variables
declared as int, long, short, long long, arguably unsigned or signed
char, and derivatives or aliases of these types. Also count pointers
such as int *, int **, and the like. Then see how many times the
variable, or the variable pointed to, is used to ultimately derive index
calculations. If it is, score it as index / size_t, if not, score it as
non-index, non-size_t. Note that if we have

OK, first file I check has 35 ints of which only 9 have anything to do
with indexing or size. That was in a 1326 line file. On a quick scan of
some other functions in the project I did not see any functions with
many indexing operations, and most of the integer variables were not
used in deriving those indexes or counting objects.
int add(int *x, int N)
{
int answer = 0;
for(i=0;i<N;i++)
answer += x;
return answer;
}

we'd have two index / size_t variables (N and i) and two non-indexes, x
and answer, unless the variables are used elsewhere in index
calculations. If we index into an array on the result, x and answer
would need to be scored as index.


So by your count it is hardly overwhelming on this example function.
You might say "what happens if add() is called twice, once to derive an
index once not?" You'll also note that it is possible to write code in
such a way as to scew the results. We could do while(N--) to get rid of
an index variable, for example. These aren't worth worrying about.

Assuming that all arrays can grow until the computer runs out of memory,
that tells you how many size_t's you need in the program. Of course that
assumption doesn't necessarily hold. But that's a slightly different
argument.

Now is that "pure assertion?" does it constitute a "near pure
assertion?" or is it actually a testable claim and a coherent argument?

Well, it is testable but it fails the test when I apply it to the code I
have here.

Looking at a small amount of the code for clamav I saw closer to 50% of
integer variables having something to do with indexing/size (including
variables relating to indexing/size of files), but it did not appear to
be quite up to 50%.

However, it is normally the person making the claim that is expected to
provide the evidence. You are still asserting what we will find if we do
the test, not providing the evidence to back up your claim.
 
C

Chris Torek

All indices must ultimately be positive.

You mean "nonnegative" (as I think someone else has already noted).
The problem is that intermediate values can be negative, which
doesn't happen often, but not so infrequently as not to be a problem.

So just use an unsigned type. The "negative" numbers will be large
positive numbers, but because you are working in a ring mod 2-sup-K
for some K, they work EXACTLY THE SAME AS "negative" numbers (at
least for everything you will do with them).

The test for "is some variable notionally negative" is simply "x
greater than LIM" for some constant LIM. If you set this limit to
cause approximately half of the numbers to be "negative", you merely
need to squint a bit to realize you have achieved "two's complement"
arithmetic.

For a concrete example, if you are using "unsigned int" and UINT_MAX
is 65535, the "negative half" of the space is all those values in
the range [32768 .. 65535]. If UINT_MAX is 4294967295, the "negative
half" of the space is those values in [2147483648 .. 4294967295].

(This works fine for addition and subtraction, but requires a fixup
step for multiplication and division, when thinking of half of the
space as "negative". Anyone who has ever coded multiply and divide
routines for CPUs that lack the instructions should be familiar
with this.)
A trap on overflow is not bad behaviour incidentally. It is good behaviour.

So you have an implementation in which:

int i;
...
i = INT_MAX - 4;
...
i += 32; /* result should be INT_MAX + 31, ie, "overflow" */

causes a runtime trap? Those are, alas, all too rare. Can you
name your implementation?
 
K

Keith Thompson

C doesn't have array indexes. In an expression like ptr, i is the
second operand of the subscripting operator, whose other operand is a
pointer, not an array. It measures the signed displacement from ptr
in object units.

[...]

The section in the standard that describes the [] operator is titled
"Array subscripting". One of the operands must be a pointer, but it
must be a pointer to an array object (where a single object is treated
as an array of on element).
 
K

Keith Thompson

Flash Gordon said:
Richard Heathfield wrote, On 12/11/07 09:42:

<snip>

Actually, postive means greater than zero in mathematics, so if as
Malcolm suggested one is using the maths definition you (Richard) were
still correct in your correction. The same applies in English
according to a brief check of a few dictionaries.

I don't think anyone was disagreeing over the meanings of the terms
"positive" and "non-negative". (Positive values are greater than
zero; non-negative values are greater than or equal to zero.)

Somebody (attribution lost) wrote that "All indices must ultimately be
positive". RH corrected that to "non-negative", since 0 is a valid
index. Malcolm IMHO muddied the waters by bringing a non-C context
into the discussion.
 
R

Richard Heathfield

Flash Gordon said:

Actually, postive means greater than zero in mathematics, so if as
Malcolm suggested one is using the maths definition you (Richard) were
still correct in your correction.

Well, his original claim was not that 0 is positive but that all indices
are positive, which is true from many mathematicians' point of view, since
they tend to label indices starting from 1. Having said that, I've seen a
good few mathematical works in which indices have been labeled from 0.
 
M

Malcolm McLean

Flash Gordon said:
Looking at a small amount of the code for clamav I saw closer to 50% of
integer variables having something to do with indexing/size (including
variables relating to indexing/size of files), but it did not appear to be
quite up to 50%.

However, it is normally the person making the claim that is expected to
provide the evidence. You are still asserting what we will find if we do
the test, not providing the evidence to back up your claim.
People can't count them right. Maybe because the idea of "ultimately used to
derive indices" is a bit woolly.
If we say


for(i=start;i<=end;i++)
array = 0;

start and end hold index values, though they are not actually used as the
indexing variable themsleves. They are intermediates.

of the other hand if we write

cmp = strcmp(argv[1], "-x");
if(cmp == 0)
outfile = argv[2]
else
outfile = argv[1];

we wouldn't say that argv[1] is really an "index string" and cmp an index
intermediate The difference is obvious but takes a little bit of sensitivity
to apply.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top