memcmp() - compare buffer for 0

Mark Adler · Oct 11, 2011

For example:

int isallz(char *buf, size_t len)
{
long *wide;

while ((size_t)buf & (sizeof(long) - 1))
if (len--, *buf++)
return 0;
wide = (long *)buf;
while (len >= sizeof(long))
if (len -= sizeof(long), *wide++)
return 0;
buf = (char *)wide;
while (len)
if (len--, *buf++)
return 0;
return 1;
}

And of course, there's an error in there as an exercise for the reader
to find. Anyway, it illustrates the approach.

Mark

Ben Bacarisse · Oct 11, 2011

bartek szurgot said:
Just another few data points: I see the plain loop version as being 14
times *slower* than the memcmp version for 512MB, provided I don't
optimise! At gcc -O1 and above the memcmp takes twice the time the
plain loops does. The reason seems to be that gcc replaces memcmp with
a repz cmpsb instruction (plus housekeeping gubbins) and these string
instructions have a reputation for being slow.

The memcmp method is still four times faster than the loop (on this
data, on this machine, with the library, with... etc) provided gcc can
be told not to inline it to cmpsb instruction.

To summarise:
for loop memcmp
gcc unoptimised 13.333s 0.932s
gcc optimising 4.160s 8.250s

I'd post the CPU, gcc version and so on but I think the real messages is
*measure*. You can still be surprised after <mumble> years in this
field.

Click to expand...

i was surprised with your results even more than with my own. i
already removed test code i've wrote before posting my last message and
so i had to rewrite it once more... and bang! it looks like i had a bug
(probably #ifndef instead of #ifdef for switching implementations with
-D), because now i'm getting similar computation times i got before, but
both versions are faster for memcmp(). namely:

for() memcmp()
-g3 ~1.2 ~0.11
~~-O3 ~0.4 ~~~0.11~~~~

It would seem that your compiler is not replacing memcmp when
optimising. Taking this into account, the two sets of results agree
pretty well -- there's a nearly constant factor between your results and
mine.

I could leave it at that and pretend that my machine is old and ten
times slower than yours, but I'll come clean -- to get accurate times I
execute the test 10 times and I did not divide by the 10! I usually
don't bother to divide because I normally only care about ratios, not
actual MB/s.

i think this can be ok, since it would be reasonable that call to ext.
library lasts the same amount of time, regardless of release/debug
builds and hand-made code to be faster, when optimized.

source code for this is:
http://pastebin.com/sP9hexnW#
(note: this code is only a PoC - it does not check all errors)

Click to expand...

It's C++. Probably makes no difference. I still get slower times when
optimising memcmp:

for loop memcmp
g++ 1.4 0.10
g++ -O2 0.43 0.83

can you check your code as well, please?

Click to expand...

I think the code is alright -- the units were wrong.
http://pastebin.com/cGZxpSia

MaciekL · Oct 12, 2011

Is there another way to compare the input buffer for being equal to
some pattern (all zeros in this case), i.e. ti avoid having additional
array zerobuf?

Click to expand...

I'm surprised that no one offered the obvious answer that uses memcmp()
without another array:

if (MAX_LEN> 0&& buf[0] == 0&& memcmp(buf, buf + 1, MAX_LEN - 1) == 0)
puts("buf is all zeros");
else
puts("buf has at least one non-zero value");

Though it's probably faster to compare each entry to zero, or even
faster to recast to an array of large integers and compare those to
zero.

Click to expand...

Usual alignment caveats apply!

int is_all_bytes_set(void const * data, int value, unsigned int size)
{
memset(&value, value, sizeof(value));
if (size > sizeof(value))
{
char const * ptr = (char const *)data;
return !(memcmp(ptr, &value, sizeof(value)) ||
memcmp(ptr, ptr + sizeof(value), size - sizeof(value)));
}
else
{
return !memcmp(&value, data, size);
}
}

bartek szurgot · Oct 12, 2011

It would seem that your compiler is not replacing memcmp when
optimising. Taking this into account, the two sets of results agree
pretty well -- there's a nearly constant factor between your results and
mine.

I could leave it at that and pretend that my machine is old and ten
times slower than yours, but I'll come clean -- to get accurate times I
execute the test 10 times and I did not divide by the 10! I usually
don't bother to divide because I normally only care about ratios, not
actual MB/s.

ok - this explains a lot.

It's C++. Probably makes no difference. I still get slower times when
optimising memcmp:

for loop memcmp
g++ 1.4 0.10
g++ -O2 0.43 0.83

I think the code is alright -- the units were wrong.
http://pastebin.com/cGZxpSia

after compiling & comparing execution times of both codes i still
couldn't understand the results. why, in (logically) identical programs
for()-version gives similar times (buf vs. buf++ was the only
difference), while memcmp() differs about 6x! :/

after doing some digging and exercising i found what looks like a gcc
performance problem.
http://pastebin.com/nMcD7f6z#
when compiling with "inline", time is far different then when no inline
is present. it does not affect debug version though. in short:

inline non-inline
-g3 ~0.8 ~0.8
-O3 ~0.14 ~0.8

i got this results for gcc-4.6. gcc-4.5 compiled code runs always
(deb/rel) in ~0.14 (i.e. fine). you probably observe similar effect
on your machine.

BartC · Oct 12, 2011

bartek szurgot said:
On 10/11/2011 07:31 PM, Ben Bacarisse wrote:

after doing some digging and exercising i found what looks like a gcc
performance problem.
http://pastebin.com/nMcD7f6z#
when compiling with "inline", time is far different then when no inline
is present. it does not affect debug version though. in short:

Benchmarking with gcc always gave me problems, especially when comparing
with other languages.

Anything over -O1 optimisation tended to remove essential bits of code whose
execution time I was trying to compare! So I only used -O1.

Although it's difficult with your essentially one-line program to see
exactly what it could leave out...

bartek szurgot · Oct 12, 2011

Anything over -O1 optimisation tended to remove essential bits of code
whose execution time I was trying to compare! So I only used -O1.

it is common for the compilers to remove dead code. it is actually a
nice feature - why computing something not used at all? there is a
simple trick to overcome this during measurements - output some
(randomly chosen) part of the result. then compiler cannot optimize-out
code, in such a situation, since it is used (printed on the screen). on
the other hand such a simple operation (usually) does not influence
measurements significantly.

Ben Bacarisse · Oct 12, 2011

bartek szurgot said:
after compiling & comparing execution times of both codes i still
couldn't understand the results. why, in (logically) identical programs
for()-version gives similar times (buf vs. buf++ was the only
difference), while memcmp() differs about 6x! :/

I thought I explained that when I posted the first times. I certainly
had an explanation that for what was happening on my machine with my
version of gcc.

after doing some digging and exercising i found what looks like a gcc
performance problem.
http://pastebin.com/nMcD7f6z#
when compiling with "inline", time is far different then when no inline
is present. it does not affect debug version though. in short:

inline non-inline
-g3 ~0.8 ~0.8
-O3 ~0.14 ~0.8

i got this results for gcc-4.6. gcc-4.5 compiled code runs always
(deb/rel) in ~0.14 (i.e. fine). you probably observe similar effect
on your machine.

Click to expand...

No, I don't, but I don't think it matters. For one thing, you are using
C++ and I am using C. We might have different versions of gcc and/or
the C library. Who knows? The key points, for me, are:

(a) Often, there is no useful notion of "faster" code. The same code
will perform differently in different environments or even with
different compiler flags.

(b) Even in one environment, what's fast will depend on the data. We'd
probably be having a different discussion if you'd chosen a 30 byte
buffer.

(c) It's useful to have a feel for the relative costs of certain basic
operations, but clarity of expression will often trump speed, unless
measurement shows that readability must be sacrificed to get a program
performing as required.[1]

(d) In most cases, it's algorithmic changes that will get you really
valuable performance gains.

[1] I've just spent some time writing a slow Reed-Solomon decoder. All
the existing code I looked at was super fast but, for me, almost
unintelligible. That was probably the right thing for those authors to
do, since there are many situations where fast decoding is essential.
But I wanted to be able to see what the code was really doing, so I have
de-optimised like mad. The result is probably 10 (maybe more!) times
slower than the fastest software decoders, but I now largely understand
it, and it is fast enough because the program spends a tiny faction of
its run time decoding a small amount of data

Jorgen Grahn · Oct 12, 2011

Benchmarking with gcc always gave me problems, especially when comparing
with other languages.

Anything over -O1 optimisation tended to remove essential bits of code whose
execution time I was trying to compare! So I only used -O1.

So you'd rather have meaningless results? If your benchmark doesn't
survive the optimizer, it's not relevant to real code.

You *have* to make sure the things you are calculating actually get
used, as far as the compiler can tell. One way is feeding it into some
void touch(void*) function in another translation unit -- there's no
way around that as far as I can see.

/Jorgen

Noob · Oct 12, 2011

Jorgen said:
You *have* to make sure the things you are calculating actually get
used, as far as the compiler can tell. One way is feeding it into some
void touch(void*) function in another translation unit -- there's no
way around that as far as I can see.

Some optimizations can be deferred until link-time ;-)

e.g.
http://gcc.gnu.org/wiki/LinkTimeOptimization
http://llvm.org/docs/LinkTimeOptimization.html

Regards.

Jorgen Grahn · Oct 13, 2011

Some optimizations can be deferred until link-time ;-)

e.g.
http://gcc.gnu.org/wiki/LinkTimeOptimization
http://llvm.org/docs/LinkTimeOptimization.html

True, but that is not widely used yet. The gcc version seemed (last
time I looked) as an experimental feature.

/Jorgen

buffer size	3	May 4, 2009
optimizing the switch statement in Duff's Device (casting a label, label abuse)	10	Oct 10, 2007
Generic programming in C	46	Apr 17, 2010
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Using the containers library	0	Aug 3, 2011
huffman encoder	16	Dec 21, 2005
feedback on code design	23	May 30, 2012
Macro NULL or 0	5	May 25, 2009

memcmp() - compare buffer for 0

Mark Adler

Ben Bacarisse

MaciekL

bartek szurgot

BartC

bartek szurgot

Ben Bacarisse

Jorgen Grahn

Noob

Jorgen Grahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads