Malcolm said:
If you were a games programmer you'd have something called a DMA engine,
which will do block transfers and clears in parallel.
DMAmemcpy() is one of the first things to implement / learn how to use.
Don't teach your grandmother to suck eggs. I was
writing parallel code before C was invented, running
simultaneous programs in the CPU and in the hardware of
two independent I/O channels, and getting the whole thing
to synchronize. Real-time streaming quadrophonic audio
from disk to A-to-D converter, using 1960's hardware.
(Full disclosure: The original version of this program
was written by someone else, but I wound up owning it for
a couple of years and extended it in various ways. I
learned a lot by reading that other guy's code; he was a
really good craftsman.)
Oddly enough, some of the now-unfashionable languages
and libraries of the time were able to overlap I/O with
processing, a capability lacking in C's simple model.
Nowadays we try to parallelize by foisting the problem
off on the O/S; this is effective to some extent, but
involves some compromises in throughput. I recently
trouble-shot a customer problem that arose entirely from
relying on the O/S to do the program's buffering for it;
if the program's I/O model had allowed it to control its
own buffering the problem would never have arisen.