J
James Harris
....
Understood, thanks. There's an interesting point here. The
straightforward code to implement that expression might be
mov rmid, rhigh
sub rmid, rlow
shr rmid, 1
add rmid, rlow
but that could be improved to
mov rmid, rlow
add rmid, rhigh
rcr rmid, 1
because rcr will bring back in that 'lost' carry bit and IIRC rcr by
one bit is generally as fast as shifts and the rotates which don't
include the carry. (Rotates including the carry are not so fast if
multiple bit positions are moved at the same time.)
In the case in point since the top bits will not be set it can be
improved to
lea rmid, [rlow + rhigh]
shr rmid, 1
on most CPUs. The Intel Atom is the only one I am aware of for which
this would be slowed down by a stall if either rlow or rhigh were
generated just before the lea instruction. For the Atom, I think the
code above including rcr might be best (or rcr could be replaced by
shr).
James
That form can cause an overflow when the numbers are high. Otherwise
(high+low)/2 is marginally faster.
Understood, thanks. There's an interesting point here. The
straightforward code to implement that expression might be
mov rmid, rhigh
sub rmid, rlow
shr rmid, 1
add rmid, rlow
but that could be improved to
mov rmid, rlow
add rmid, rhigh
rcr rmid, 1
because rcr will bring back in that 'lost' carry bit and IIRC rcr by
one bit is generally as fast as shifts and the rotates which don't
include the carry. (Rotates including the carry are not so fast if
multiple bit positions are moved at the same time.)
In the case in point since the top bits will not be set it can be
improved to
lea rmid, [rlow + rhigh]
shr rmid, 1
on most CPUs. The Intel Atom is the only one I am aware of for which
this would be slowed down by a stall if either rlow or rhigh were
generated just before the lea instruction. For the Atom, I think the
code above including rcr might be best (or rcr could be replaced by
shr).
James