Short-circuiting in C

J

James Kuyper

Hey CLC,

I was wondering if there is a way to avoid the C short circuiting for
boolean expressions such as the following:

if(a && b && c) {
/* Do stuff */
}

I am currently doing GPU programming and it would be better to evaluate
the entire condition, even if a is 0, since branches are slow.

I assume that, if you're willing to have the entire condition evaluated,
then neither b nor c represent expressions with side-effects. In
particular, neither of them is or involves the value of an object
defined with volatile.

Then there's two basic cases that need to be distinguished:
1. The compiler can figure out that neither b nor c are expressions with
side effects.
If this is the case, then any sufficiently good compiler targeting your
platform should generate code that evaluates both b and c, and then
discarding their values if not needed; at least, it should do so
whenever it makes sense to do so. This is allowed by the C standard,
even though it says explicitly that they should only be evaluated if
needed. That's because of the fact that they have no side-effects. As a
result, there's no way for a strictly conforming program to determine
whether or not the compiler is doing this, so the as-if rule allows the
compiler to do this.
If your compiler isn't smart enough to perform this optimization
automatically, there's no way in C to tell it explicitly to perform it.

2. The compiler cannot figure out that neither b nor c have side-effects.
About the only way this could be true is if one or both of those
expressions involve a function call. If all three expressions have
integral type, try:

if(a & b & c)
{
}

The use of '&' rather than '&&' removes the short-circuiting behavior,
but the net result is the same, as long as integer types are involved.
If any of the expressions has a floating point or pointer type, you'll
have to insert !=0, and that will probably be converted into a branch,
which is precisely what you're trying to avoid.
 
K

Keith Thompson

James Kuyper said:
2. The compiler cannot figure out that neither b nor c have side-effects.
About the only way this could be true is if one or both of those
expressions involve a function call. If all three expressions have
integral type, try:

if(a & b & c)
{
}

The use of '&' rather than '&&' removes the short-circuiting behavior,
but the net result is the same, as long as integer types are involved.

And as long as they all have values of 0 or 1.

[...]
 
B

blmblm

[...]
C has no way to express the computation you desire. C's abstract
machine has only one flow of control, if we ignore the rather mysterious
effects of asynchronous signals and `volatile' variable. There is no
C construct that will evaluate a,b,c in parallel execution paths and
then somehow combine the three results into a single test. You've
chosen the wrong implementation language.

Hesitant though I am to argue with someone who generally seems
far more expert than I am ....

"My blushes, Watson." (And don't be too awed by my expertise;
the only reason I'm in the top ten of this group -- on good days --
is that the top twenty have stopped posting.)

Well, all right, if that's too much flattery, how about

s/far more expert than I am/very clueful/

?
I don't get the impression at all that the OP wants a, b, and
c evaluated in parallel. As I understand it, he just wants to
be sure that there is no branching involved in the evaluation
of the condition (a&& b&& c). I admit I don't have a clear
understanding of the platform he's targeting, but it sounds like
it somehow involves multiple processing elements [*] executing the
same compiled code, and under those circumstances I can vaguely
imagine that performance might be better if all these processing
elements execute exactly the same instructions (i.e., no branches).

[*] The ones that put the "MD" in "SIMD" is what I'm thinking here,
though admittedly not with as much clarity as I might like.

He didn't say anything about parallel execution at first, which
is why I gave a rather different response on the initial go-around.
But later he wrote that "On the GPU I am using, threads execute
simultaneously [provided...]," and I understood that as a request for
a C construct that expresses the "provided" and guarantees parallel
execution.

I think what I'm not getting is why you're interpreting his words as
a request to have *a, b, and c* evaluated ....

Oh, maybe we're talking at cross purposes here! I understand you to
be saying that he wants the three parts of the conditional executed
concurrently, presumably each in its own thread, and that's what
didn't/doesn't make sense to me.

But maybe in fact you're asking about how he hopes to have multiple
threads concurrently evaluating the whole expression (a && b && c)?
Now *that* would at least make sense as something one might want the
program to do.

But in general I understand him to be saying that the parallelism
(multiple threads operating concurrently, each evaluating the whole
expression) is being provided by some other part of the platform,
and he just wants to know how to come up with an expression whose
evaluation doesn't involve branching. That seems to me to be entirely
topical for this group and something one might want to do for other
reasons (such as improved performance on a platform where any kind
of branching is slow), though admittedly some of those alleged reasons
might be ill-advised.
C has conditional constructs that may (*may*) involve no
branches, given smooth seas and a following wind, but has nothing that
can guarantee or even talk about parallel execution, leaving aside
signals and volatiles.

Sure, but I don't think that's what he's asking ....
'course, it's possible I've misunderstood the O.P. or blundered
in some other way, and will be summarily drummed out of the top ten.
Again. (Anybody recall the Brain Trust of the TV series "Scrubs?"
Them's my guys ...)

No room for error, eh? :)?
 
B

BGB

I don't get the impression at all that the OP wants a, b, and
c evaluated in parallel. As I understand it, he just wants to
be sure that there is no branching involved in the evaluation
of the condition (a&& b&& c). I admit I don't have a clear
understanding of the platform he's targeting, but it sounds like
it somehow involves multiple processing elements [*] executing the
same compiled code, and under those circumstances I can vaguely
imagine that performance might be better if all these processing
elements execute exactly the same instructions (i.e., no branches).

I was (initially) under the impression that he wanted consistency between
CPU and GPU performance, and GPUs can't do short-circuit evaluation.

On a SIMD architecture (e.g. a GPU), you only have one program counter for
all processing units, so a branch where the test result can vary between
units cannot be implemented in the conventional manner (by modifying the
program counter). An if-else construct involves executing both branches on
all units; each unit ignores the instructions in one of the branches
depending upon whether the test is true or false on that particular unit.

OTOH, the OP may have just misunderstood what "branches are slow" means in
the context of GPU programming. They're slow in the sense that branching
over a set of instructions takes just as long as executing them. That
doesn't mean that tricks which eliminate branches will necessarily be any
faster.

AFAIK, there may be a point of difference between nVidia and ATI GPUs
here, where nVidia GPUs will in many cases skip over the non-evaluated
code and ignore the instructions, and ATI GPUs will actually perform the
branch, but typically go down only one side of the branch at a time
(having different threads branch differently may mean first evaluating,
say, all threads where the jump was true, followed by all branches where
it was false, ...).

also, the GPUs are normally organized into a large number of "cores"
which operate independently of each other, but each core may operate
some number of threads in parallel, each thread in turn having to move
in lock-step with the others.


granted, I am no particular expert on GPUs, and most of my personal
experience has been with OpenGL and GLSL, rather than OpenCL and its
modified C-subset (and, sadly, the inability to use function pointers
would very much hinder many of my usual coding practices). I may use it
eventually though.

I think I remember hearing at some point about projects trying to run
both Java-ByteCode and ActionScript on the GPU as well, ...
 
B

BGB

Oh, I get it now. The branches referred to are in the '&&' operator.
Thanks.

hmm, maybe:
!((!a)|(!b)|(!c))

I don't know if any have mentioned this one.
reasoning: (!a) will reduce the value to 0/1, allowing plain bitwise
operators; if any of a/b/c is false, the result of (!a)/(!b)/(!c) is 1
(rather than 0), which will then be reversed by the final "!".


but, who knows which optimizations the GPU-specific compiler may have
done?...

a compiler may well observe that if all of the operations are clean /
non-destructive, then the as-if rule will allow optimizing away the
short-circuit.

or, knowing about this, they may well have just "reinterpreted" the &&
and || operators for their particular target, since what they have isn't
strictly C anyways, and so there is no particular need for them to
follow pedantic interpretations of the standard either, for that matter.
 
J

James Kuyper

And as long as they all have values of 0 or 1.

Sorry, you're right. I was in rush when I wrote that, and I was thinking
of || and |, rather than && and &.
 
H

Herbert Rosenau

Hey CLC,

I was wondering if there is a way to avoid the C short circuiting for
boolean expressions such as the following:

if(a && b && c) {
/* Do stuff */
}

I am currently doing GPU programming and it would be better to evaluate
the entire condition, even if a is 0, since branches are slow.
if (!(a | b | c)) {
/* do stuff on neither a, b and c are true */
} else {
/* either a, b or c are false */
}

--
Tschau/Bye
Herbert

Visit http://www.ecomstation.de the home of german eComStation
eComStation 2.0 ist da!
 
M

Michael Press

Eric Sosman said:
[...]
C has no way to express the computation you desire. C's abstract
machine has only one flow of control, if we ignore the rather mysterious
effects of asynchronous signals and `volatile' variable. There is no
C construct that will evaluate a,b,c in parallel execution paths and
then somehow combine the three results into a single test. You've
chosen the wrong implementation language.

Hesitant though I am to argue with someone who generally seems
far more expert than I am ....

"My blushes, Watson." (And don't be too awed by my expertise;
the only reason I'm in the top ten of this group -- on good days --
is that the top twenty have stopped posting.)
I don't get the impression at all that the OP wants a, b, and
c evaluated in parallel. As I understand it, he just wants to
be sure that there is no branching involved in the evaluation
of the condition (a&& b&& c). I admit I don't have a clear
understanding of the platform he's targeting, but it sounds like
it somehow involves multiple processing elements [*] executing the
same compiled code, and under those circumstances I can vaguely
imagine that performance might be better if all these processing
elements execute exactly the same instructions (i.e., no branches).

[*] The ones that put the "MD" in "SIMD" is what I'm thinking here,
though admittedly not with as much clarity as I might like.

He didn't say anything about parallel execution at first, which
is why I gave a rather different response on the initial go-around.
But later he wrote that "On the GPU I am using, threads execute
simultaneously [provided...]," and I understood that as a request for
a C construct that expresses the "provided" and guarantees parallel
execution. C has conditional constructs that may (*may*) involve no
branches, given smooth seas and a following wind, but has nothing that
can guarantee or even talk about parallel execution, leaving aside
signals and volatiles.

'course, it's possible I've misunderstood the O.P. or blundered
in some other way, and will be summarily drummed out of the top ten.
Again. (Anybody recall the Brain Trust of the TV series "Scrubs?"
Them's my guys ...)

I have two coins worth thirty cents total.
One of them is not a nickel.
 
B

BGB

Eric Sosman said:
[...]
C has no way to express the computation you desire. C's abstract
machine has only one flow of control, if we ignore the rather mysterious
effects of asynchronous signals and `volatile' variable. There is no
C construct that will evaluate a,b,c in parallel execution paths and
then somehow combine the three results into a single test. You've
chosen the wrong implementation language.

Hesitant though I am to argue with someone who generally seems
far more expert than I am ....

"My blushes, Watson." (And don't be too awed by my expertise;
the only reason I'm in the top ten of this group -- on good days --
is that the top twenty have stopped posting.)
I don't get the impression at all that the OP wants a, b, and
c evaluated in parallel. As I understand it, he just wants to
be sure that there is no branching involved in the evaluation
of the condition (a&& b&& c). I admit I don't have a clear
understanding of the platform he's targeting, but it sounds like
it somehow involves multiple processing elements [*] executing the
same compiled code, and under those circumstances I can vaguely
imagine that performance might be better if all these processing
elements execute exactly the same instructions (i.e., no branches).

[*] The ones that put the "MD" in "SIMD" is what I'm thinking here,
though admittedly not with as much clarity as I might like.

He didn't say anything about parallel execution at first, which
is why I gave a rather different response on the initial go-around.
But later he wrote that "On the GPU I am using, threads execute
simultaneously [provided...]," and I understood that as a request for
a C construct that expresses the "provided" and guarantees parallel
execution. C has conditional constructs that may (*may*) involve no
branches, given smooth seas and a following wind, but has nothing that
can guarantee or even talk about parallel execution, leaving aside
signals and volatiles.

'course, it's possible I've misunderstood the O.P. or blundered
in some other way, and will be summarily drummed out of the top ten.
Again. (Anybody recall the Brain Trust of the TV series "Scrubs?"
Them's my guys ...)

I have two coins worth thirty cents total.
One of them is not a nickel.

errm... "what is a quarter?...".

but, yeah, the GPU is weird that way, as it tends to execute things in
parallel (rather than the more usual serial execution).
 
E

Edward A. Falk

Hey CLC,

I was wondering if there is a way to avoid the C short circuiting for
boolean expressions such as the following:

if(a && b && c) {
/* Do stuff */
}

I am currently doing GPU programming and it would be better to evaluate
the entire condition, even if a is 0, since branches are slow.

Unless a, b, and c are all simple register values, you'll have a hard
time convincing me that simple short branches are slower than evaluating
them.
 
E

Edward A. Falk

The code that I presented always evaluates all of 'a', 'b', and
'c', which is what the OP said that he wanted.

A sufficiently smart compiler would realize that if y and z have no
side-effects, then their computation can be deferred until they're needed.
The above code would be optimized back to the original.

Now if y and z do have side-effects, then this would indeed generate
different code, forcing them to be evaluated when they might not have
been in the original.

Note that OP wanted to force evaluation of y and z in order to remove
some expensive branches. However, I think that any analysis would show
that the branches are cheaper than evaluating y and z in almost all cases.
 
E

Edward A. Falk

On the GPU I am using, threads execute simultaneously as long as they
all execute the exact same instructions (ie, they all take the branch).
If one of the threads takes a different execution path, then each
thread has to run in serial which means the parallel speed gain is lost.

Wow. I'm curious as to what GPU this is, and what compiler you're
using that takes advantage of this.

<woolgathering>
I worked on Sun's "Majik" architecture (I may have the spelling wrong)
back in the day. It effectively had four 32-bit cpus, running in tandem.
The first cpu was more general-purpose than the other three, and had a full
set of control and branch instructions. The other three could only do
arithmetic, and operated in lock-step with the primary cpu.

If the calculations of a, b, and c in the original code were all
identical, then the compiler would typically arrange for them to
be computed in parallel and the three secondary cpus and the results
combined at the end. I wouldn't really call these separate threads,
but they were definitely parallel computations.
</woolgathering>

Sounds like you're describing some sort of similar vector processor,
in which case, yes, I see why you would want the compiler to
compile (a && b && c) without short-circuiting.

Of course, a sufficiently smart compiler would recognize that b and c can
be computed without harm even if their values wouldn't be needed after
all.

Have you actually compiled your code and determined that the compiler
is not using parallelism?
 
E

Edward A. Falk

Sounds like you're describing some sort of similar vector processor,
in which case, yes, I see why you would want the compiler to
compile (a && b && c) without short-circuiting.

Ahh, never mind. Saw your other post. I guessed right.
 
E

Edward A. Falk

C has no way to express the computation you desire. ...

See my previous post. OP is working with a cpu that doesn't
have multiple threads of execution, per se, but does have
parallel threads of computation.

Imagine:

a = x1 * x2 + x3;
b = y1 * y2 + y3;
c = z1 * z2 + z3;

I've worked on a CPU that could compute those three expressions
in perfect parallel. We made sure our compiler recognized cases
like this and optimized accordingly.

I think OP's problem is that the (a && b && c) short-circuiting
is preventing the compiler from computing those values in
parallel. That's really a failure on the compiler's part, but
OP is hoping there's a way to hint to the compiler that
short-circuiting is not necessary.

And yes, without extending the language, there's no way to
*force* the compiler to compute that stuff in parallel.
 
N

Nizumzen

Note that OP wanted to force evaluation of y and z in order to remove
some expensive branches. However, I think that any analysis would show
that the branches are cheaper than evaluating y and z in almost all cases.

Not on a GPU.

Branching on a GPU kills everything. Don't do it unless you really,
really, really need too.

Although this begs the question why he is using C as both OpenCL and
CUDA have their own C like languages for handling the code that
actually runs on the GPU neither of which follows C's rules exactly as
far as I am aware (most likely for this very reason).
 
E

Eric Sosman

See my previous post. OP is working with a cpu that doesn't
have multiple threads of execution, per se, but does have
parallel threads of computation.

C has no way to express "parallel threads of computation"
(except in the restrictive senses of volatile variables and
asynchronous signals).
Imagine:

a = x1 * x2 + x3;
b = y1 * y2 + y3;
c = z1 * z2 + z3;

I've worked on a CPU that could compute those three expressions
in perfect parallel. We made sure our compiler recognized cases
like this and optimized accordingly.

Fine: Your compiler made promises that are not made by C and
are not expressible in C. There's nothing preventing you from
making such promises, and in the right circumstances such promises
are useful and laudable. But they're *your* promises, not C's,
and compilers other than your own are under no obligation to pay
the slightest heed to them. C has no way to tell those other
compilers about a desire for parallelism, nor any means to detect
a promise broken.
I think OP's problem is that the (a&& b&& c) short-circuiting
is preventing the compiler from computing those values in
parallel. That's really a failure on the compiler's part, but
OP is hoping there's a way to hint to the compiler that
short-circuiting is not necessary.

And yes, without extending the language, there's no way to
*force* the compiler to compute that stuff in parallel.

Exactly. The O.P. needs something other than C, or something
in addition to C.
 
M

Michael Press

BGB <[email protected]> said:
[...]
I have two coins worth thirty cents total.
One of them is not a nickel.

errm... "what is a quarter?...".

but, yeah, the GPU is weird that way, as it tends to execute things in
parallel (rather than the more usual serial execution).

The riddle I recounted was used in a memorable
episode of _Scrubs_ featuring the Brain Trust.
 
B

BGB

BGB<[email protected]> said:
[...]
'course, it's possible I've misunderstood the O.P. or blundered
in some other way, and will be summarily drummed out of the top ten.
Again. (Anybody recall the Brain Trust of the TV series "Scrubs?"
Them's my guys ...)

I have two coins worth thirty cents total.
One of them is not a nickel.

errm... "what is a quarter?...".

but, yeah, the GPU is weird that way, as it tends to execute things in
parallel (rather than the more usual serial execution).

The riddle I recounted was used in a memorable
episode of _Scrubs_ featuring the Brain Trust.

ok. not seen that show...

would insert another random TV quote, except I can't think of much...
may as well then return to having nano-machines gather up Santa-knowledge...

results in picture of Jabba The Hut with a Santa hat:
"hooo... hooo... hooo... I be Santa Claus..."
little green thing: "te he he he he... Santa Claus...".

gives out a pizza...
"hooo... hooo... hooo... now I be Pizza Hut..."
....


or such...
 
B

BGB

Unless a, b, and c are all simple register values, you'll have a hard
time convincing me that simple short branches are slower than evaluating
them.

actually, as far as performance is concerned, simple computations are
faster than a (mispredicted) jump over them often even on plain x86
(including with floating point, albeit with the partial exception of
slow/complex operations like "sqrt()" or "sin()", or large/bulky
calculations).

so, in my experience it is often a little faster just to do math which
works over the entire range, rather than having to add conditionals into
the mix.

granted, it depends some, as many algorithms are expressed better with
conditionals, or need them, so it is not usually worthwhile to try to
micro-optimize them out.


this situation also causes "switch()" when using a jump table to be a
surprisingly expensive operation, which can matter somewhat when using a
switch with a decent number of cases in a tight loop (for a small number
of cases, compilers will often optimize this by using a slightly faster
if-else chain internally, due mostly to branch-predictor magic...).


also, memory accesses are very fast when in-cache (almost as fast as
with plain registers), and IME it means that having values in registers
is not necessarily better performance-wise, since often trying to keep
values in registers may lead to a larger total number of loads and
stores (due to register spills, ...) so often it will make more sense to
leave variables on-stack or similar.

IME, there has also been little difference performance with between
simple and complex memory addresses, as apparently the processor
calculates complex memory addresses more or less for free.

actually, the above also results in a possible optimization of
calculating integer expressions like "x*4+y" using a "lea" instruction.


or such...
 
E

Eric Sosman

actually, as far as performance is concerned, simple computations are
faster than a (mispredicted) jump over them often even on plain x86
(including with floating point, albeit with the partial exception of
slow/complex operations like "sqrt()" or "sin()", or large/bulky
calculations).

In a purely sequential model, only the accuracy of the branch
prediction will make any difference. Either way you write the code,
you will take exactly one branch if there's an `else', zero or one
branches if not. You'll take the same number of branches with `&&'
as with `&' or a variant; the only difference is in how many branches
you'll execute but not take. If they're usually not taken but usually
predicted to be taken, all I can say is your compiler doesn't predict
very well and should get a job doing weather forecasts.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
474,078
Messages
2,570,570
Members
47,204
Latest member
MalorieSte

Latest Threads

Top