Implementing my own memcpy

N

Nils Weller

I have no idea what it goes with, because your previous article was
much too long to read. :) However, you have still goofed, because
there is no such standard function as 'read'. Look up fread, which
IS portable.

And nobody claimed that read() is a standard C function. I explicitly
commented the code as being Unix-specific in the previous, too long
post. Moreover, the macro that triggered this sub-thread has also been
pointed out to be Unix-specific, and there has been some talk about Unix
kernel implementation and compatibility system software.

Perhaps an OT tag was missing, but I think it is clear that we aren't
talking about standard C anymore.
 
D

Dave Thompson

On 2005-06-25 11:45:13 -0400, Netocrat <[email protected]> said:

Of course there is; in fact, there are several:

Assuming a and b are of the same complete type, any of the following
will copy the contents of a into b:

#include <stdlib.h>

Not actually needed for anything in this code. (size_t is in string.h)
#include <string.h>

/*1*/ b = a;

For complete _nonarray_ types.
/*2*/ memcpy(&b, &a, sizeof b);
/*3*/ memmove(&b, &a, sizeof b);
/*4*/ const unsigned char *src = (const unsigned char*)&a;
unsigned char *dst = (unsigned char*)&b;
for(size_t i=0; i<sizeof b; ++i)
{
dst = src;
}


Rest for all complete types. And if you can determine the (a?) size by
some other means not sizeof, even objects declared-not-defined with
incomplete types.

- David.Thompson1 at worldnet.att.net
 
D

Dave Thompson

The void * type can point at arbitrary things, and a size_t can
specify a size on any machine. But to use void* you have to
convert to other types, thus:

void *dupmem(void *src, size_t sz)
{
unsigned char *sp = src;
unsigned char *dst;

if (dst = malloc(sz)) /* memory is available */
while (sz--) *dst++ = *sp++; /* copy away */
return dst; /* will be NULL for failure */

return dst - sz, unless all your callers will (and must) adjust down
the pointer before using it to access the memory, and free() it.
} /* dupmem, untested */

Note how src is typed into sp, without any casts. Similarly the
reverse typing for the return value of dupmem. The usage will be,
for p some type of pointer:
Although it would be more informative, and convenient for some
call(er)s, to declare src and sp as pointer to const void/uchar.
if (p = dupmem(whatever, howbig)) {
/* success, carry on */
}
else {
/* abject failure, panic */
}

- David.Thompson1 at worldnet.att.net
 
D

Dave Thompson

My thinking is perhaps colored by too many years of assembly coding
and instruction sets that include "decrement and branch if nonzero":

test r3
bz Laround
Lloop:
mov (r1)+,(r2)+
sobgtr r3,Lloop # cheating (but this is OK)
Laround:
and so on. (The first loop is VAX assembly, and "cheating" is OK
because r1 and/or r2 should never cross from P0/P1 space to S space,
nor vice versa, so the maximum block size never exceeds 2 GB; <snip>

Not movb? Isn't the default word=long? Or is this some overambitious
assembler that you (have to) tell about value types?

Most (I think all but first two or so) models of PDP-11 also had
sub-1-brback-ne (only) which they managed to publish as SOB before
marketing caught them. PDP-6/10 already had a whole series of SOB*,
but only SOBN or SOBG would do what you wanted here not SOB.
(All 16 dyadic booleans are implemented, but SKIP doesn't; JUMP
doesn't; the fastest jump varies but is never JUMP*; etc., etc.)

ISTR 68k, which you also mentioned (snipped), also had a mildly
offcolor opcode, somewhere else.

- David.Thompson1 at worldnet.att.net
 
D

Dave Thompson

Also C90 and C89 seem to be interchangeable terms - correct?
Effectively. C89 was the document developed "by" (under) ANSI, then
submitted to "ISO" (already JTC1?) and adopted with technically
identical contents but different numbering scheme and (I believe) some
of the boilerplate about copyright, authority, and such. Thus if you
want to refer to a clause number, as we fairly often do, you need to
specify which; and if you had a lawsuit turning on compliance to one
or the other standard you might have to produce that exact document to
support your case. But as far as what a C implementation is required
or permitted to do, and thus what a program(mer) can rely on or
expect, they are interchangeable.

In contrast C99 was voted first by "ISO" (as I understand it really
SC22), and adopted as-is by ANSI (really NCITS? INCITS?).
Finally I understand that C90/C89 had some modifications made prior to C99
- where are those detailed?

See FAQ 11.1 and .2 -- at least in the text version posted and online
at usual places; the webized http://www.eskimo.com/~scs/C-faq/top.html
has been out-of-date the last few times I checked and this is one of
the points that has changed. But:
- the statement about the Rationale was for only the original ANSI
version C89, which is no longer (realistically) available;
- it says Normative Addendum which I'm pretty sure should be
Amendment; C90 plus that amendment is sometimes called C95
- (several!) drafts of an updated Rationale for C99, as well as drafts
of C99 itself (through n869) and C0X (n1124) can be gotten from the WG
site which is now (renamed?) www.open-std.org/JTC1/SC22/WG14 .
(As well as other stuff you might be interested in, for that matter.)

And for your further delectation and enjoyment, you could get the
~1600-page e-book by Derek M Jones discussed in another thread, which
AFAICT-so-far exegizes the standard process, the resulting document,
and the language specified in it, and more.

If you actually want C90 instead of or in addition to C99, ANSI
apparently no longer sells it, but webstore.ansi.org (still) lists DIN
and AS adoptions of 9899:1990, and I'm guessing the latter might be
available to you more conveniently.

- David.Thompson1 at worldnet.att.net
 
C

Chris Torek

(Off-topic drift warning :) )

Not movb? Isn't the default word=long? Or is this some overambitious
assembler that you (have to) tell about value types?

No, just a goof; it should have been "movb".
ISTR 68k, which you also mentioned (snipped), also had a mildly
offcolor opcode, somewhere else.

I do not recall any from the 680x0 series, but the 1802 had several.

Each register was 16 bits (I am almost certain, despite the 8-bit
claim on the page referenced below), but the 8-bit opcodes could
address only the high or low half of each register, so there was
a "put low" and "put high" to write to each half, and the corresponding
pair of "get"s. This meant the 1802 had GHI, the "get high"
instruction.

The 1802 also had two special registers named P (program counter)
and X (index). However, neither P nor X were actual registers;
instead, they were register *numbers*, pointing to one of the 16
general-purpose registers. You had to use a "set p" or "set x"
instruction to point the P and X indirection at the appropriate
register. These had three-letter assembler mnemonics; the first
was SEP, and the second was the now-obvious.

(See also <http://shop-pdp.kent.edu/ashtml/as1802.htm>.)
 
C

CBFalconer

Dave said:
return dst - sz, unless all your callers will (and must) adjust down
the pointer before using it to access the memory, and free() it.

That still doesn't fix my goof above. sz ends at 0. Try this:

void *dupmem(void *src, size_t sz)
{
unsigned char *sp = src;
unsigned char *dst, *p;

if (p = dst = malloc(sz)) /* memory is available */
while (sz--) *p++ = *sp++; /* copy away */
return dst; /* will be NULL for failure */
}
 
B

BGreene

I apologize to the group but i haven't heard "decrement and branch if not
zero" in many a year.
 
N

Netocrat

On Sat, 25 Jun 2005 19:58:19 +0000, Chris Torek wrote:

[a memcpy function in response to my buggy version]
void *like_memcpy(void *restrict dst0, const void *restrict src0,
size_t n) {
unsigned char *restrict dst = dst0;
unsigned char *restrict src = src0;

if (n)
do
*dst++ = *src++;
while (--n != 0);
return dst0;
}




My thinking is perhaps colored by too many years of assembly coding and
instruction sets that include "decrement and branch if nonzero":

<snip discussion to which I responded in a later post>

I was spurred to actually benchmark the different approaches on my
machine. It's a little over the top, but my belief is that it's not
really possible to predict which approach will be faster - even knowing
the machine's architecture you can't know what the compiler will do. So
to me these sort of things are really a matter of personal preference.
So here is my attempt to back up that intuition at least on my machine.

I used the function quoted above, as well as the quoted proposed
alternative, and my function as fixed by Kevin Bagust:
void *mem_cpy( void *dest, const void *src, size_t bytes ) {
unsigned char *destPtr = dest;
unsigned char const *srcPtr = src;
unsigned char const *srcEnd = srcPtr + bytes;

while ( srcPtr < srcEnd ) {
*destPtr++ = *srcPtr++;
}
return dest;
}

I compiled at four of the levels of optimisation available on gcc (none,
-O1, -O2, -O3), and at each level performed two tests - with and without
-march=pentium4 (my machine architecture). I performed the tests at
multiple iterations of 0, 1, 2, 8, 25 and 80 bytes and timed the duration
using clock().

And the results?

At the unoptimised level, both of Chris's alternatives were equal.

In every other case the first of Chris's alternatives far outperformed the
second (by a minimum of 14% and maximum of 21%).

So I modified the 'alternative' expression from
while (n--) *dst++ = *src++;
to
while (n) {
*dst++ = *src++;
n--;
}

This brought the alternative function back close to the performance of the
original. I don't know why the degradation was occurring; presumably
something to do with one or more of the variables being decremented or
incremented one more time than necessary.

In the unoptimised case, my function outperformed Chris's functions by
about 15%. In all of the optimised cases, they were roughly equal -
varying from his performing 3% better than mine to mine performing 2%
better than his.

So even though it's platform-specific I think that this test shows that
choosing between these loop constructions should be based on personal
preference as to readability - a performance benefit can't be assumed for
any particular style - unless you are developing for a particular system
for which you know one style is more performant than the others.
 
C

Chris Croughton

In every other case the first of Chris's alternatives far outperformed the
second (by a minimum of 14% and maximum of 21%).

So I modified the 'alternative' expression from
while (n--) *dst++ = *src++;
to
while (n) {
*dst++ = *src++;
n--;
}

This brought the alternative function back close to the performance of the
original. I don't know why the degradation was occurring; presumably
something to do with one or more of the variables being decremented or
incremented one more time than necessary.

Some odd optimisation?

Incidentally, if you still have the test code around, could you also try

while (n) {
*dst = *src;
++src;
++dst;
--n;
}

(And is there a difference between n--; and --n; on your system?)

Just to get the results from the same system as used for your original
results. (Incidentally, how did they compare with the system-supplied
memcpy? I believe gcc inlines that to assembler at some optimisation
levels...)
So even though it's platform-specific I think that this test shows that
choosing between these loop constructions should be based on personal
preference as to readability - a performance benefit can't be assumed for
any particular style - unless you are developing for a particular system
for which you know one style is more performant than the others.

Indeed. And bear in mind that it may change completely with the next
version of the compiler, or switching to another compiler on the same
platform. I've found that trusting the compiler and library writers to
have picked the best optimisations is right most of the time...

Chris C
 
N

Netocrat

Some odd optimisation?

Anything's possible.
Incidentally, if you still have the test code around, could you also try

while (n) {
*dst = *src;
++src;
++dst;
--n;
}

I retested and included this modification that you suggested. Your
modification is always faster than the original while(n--) loop and is
roughly the same across all of the optimisation levels as the modification
that I made (worst performance is 17% slower than my mod at -O1 - an
aberration since for all other cases their separation is a few percent -
and best performance is 5% faster at -O3 -march=pentium4).
(And is there a difference between n--; and --n; on your system?)

I'm not sure about the general case - but I tested your modification above
with n-- and --n. There is a small variation that differs between the
optimisation levels - neither is consistently faster. The biggest
separation I found was post-decrement being about 3% faster at -O3
-march=pentium4. I repeated this test a few times to check that it wasn't
a one-off error due to system loading and the result was consistently
within the bounds of .05% and 3%. The initial 3% result is probably not
accurate but there's no doubt that in this case the compiler generates
slightly faster code for post-decrement.
Just to get the results from the same system as used for your original
results. (Incidentally, how did they compare with the system-supplied
memcpy? I believe gcc inlines that to assembler at some optimisation
levels...)

Its execution time doesn't vary between the sizes I originally tested as
much as the other functions' times do. Nor is its performance affected by
optimisation level. With or without optimisations, it is always the
slowest function for sizes of 0..8 bytes. Without optimisations, from
about 16 bytes it starts consistently performing far better - eg at 40
bytes it is 150% faster than any other function. With optimisations it's
"in the mix" - not much better or worse than the others up to roughly 40
bytes and from then on it consistently beats them.

I tested for larger sizes at all optimisation levels:

At 80 bytes the library function was a minimum of 34% faster than any
other function (340% faster when optimisation switches were not used).

At 1024 bytes it was at least 270% faster (1400% faster without
optimisations).

At 10 kilobytes it was at least 400% faster.

At 100 kilobytes it was at least 65% faster. Also optimisations changed
its performance - it was fastest without optimisations and at -O1 it was
twice as slow as without optimisations.

At 1 megabyte things had evened out and it was roughly the same as the
others and in some cases slightly slower. It performed the same at all
optimisation levels.

At 10 and 100 megabytes I only tested for -O3 -march=pentium4 and again it
was roughly the same as the other functions.
Indeed. And bear in mind that it may change completely with the next
version of the compiler, or switching to another compiler on the same
platform. I've found that trusting the compiler and library writers to
have picked the best optimisations is right most of the time...

Agreed - and if you _really_ need specific hard-core optimisations, don't
rely on the compiler except perhaps to use its output as a base - go with
assembly. That way the results aren't dependent on things beyond your
control like compiler code-generation.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,166
Messages
2,570,903
Members
47,444
Latest member
Michaeltoyler01

Latest Threads

Top