string comparison

C

CBFalconer

Netocrat said:
.... snip ...

So I tested your idea of using alignment, knowing that my
processor likes to operate on 4 bytes at a time, being equivalent
to sizeof(int).

int memcmp_using_int(const void *a, const void *b, size_t n) {
int num_ints = n / sizeof(int);
int extra_bytes = n % sizeof(int);
unsigned int *ua = (unsigned int *)a;
unsigned int *ub = (unsigned int *)b;
unsigned int *ua_max = ua + num_ints;
unsigned char *uac;
unsigned char *ubc;
unsigned char *uac_max;
Right here you run into trouble. What if the original a and b are
not suitably aligned for ints?. So you need a preamble to get them
so aligned, and that requires that you know what defines alignment
for ints. That last is only known to the implementor. Similarly
you need a postamble to handle the last few bytes.
while (ua < ua_max) {
if (*ua > *ub) .... snip code ...

This version is approximately 3 times faster than glibc and my
original version, which surprised me. Is there any reason to
_not_ make the assumption that I made - that we should operate on
units of sizeof(int) rather than sizeof(char)? It may not
necessarily improve performance - it certainly improves it on my
setup - but could it degrade it?

Are there any other ways to work out - without knowing in advance
- an appropriate unit for your system? I happened to know that
my machine's wordsize is sizeof(int), but can you work out or
approximate what it is from standard includes/definitions?

What you can't accurate work out is the alignment requirements.
Even if you are the implementor, and know all these things, you may
very well find that you have optimized the rare large comparison at
the cost of slowing down the more common small comparison.

Similar considerations apply to memmov, memcpy, strlen, strcmp,
strcpy, strcat etc.

If you know that the original arguments were created by malloc, you
then know that they are suitably aligned for ints, and need only
worry about the postamble. But then you have the highly probable
(and possibly undetectable on your hardware) error of passing an
argument not created by malloc.

Remember KISS.

--
Some informative links:
http://www.geocities.com/nnqweb/
http://www.catb.org/~esr/faqs/smart-questions.html
http://www.caliburn.nl/topposting.html
http://www.netmeister.org/news/learn2quote.html
 
N

Netocrat

Right here you run into trouble. What if the original a and b are not
suitably aligned for ints?.

On my system, it doesn't cause an incorrect result - the tests I'm running
are passing in pointers that are incremented one memory location at a
time. Whether or not it affects performance I don't know. I'm sure many
on this group would be able to answer such a question regarding an x86 but
it's something that has never concerned me before.

However, I posted this code to find out how generally applicable it was,
and as you point out this is a general concern that my code doesn't deal
with.
So you need a preamble to get them so
aligned, and that requires that you know what defines alignment for ints.
That last is only known to the implementor.

OK, that's the sort of thing I wanted to know - you've confirmed my
suspicion that it's not possible to in the general case discover those
sorts of details through standard functions.
Similarly you need a
postamble to handle the last few bytes.

Which I had; although correct me if you don't think it works as it should
(it tested fine)...
What you can't accurate work out is the alignment requirements. Even if
you are the implementor, and know all these things, you may very well
find that you have optimized the rare large comparison at the cost of
slowing down the more common small comparison.

What you say makes sense, although alignment isn't an issue in my case. I
wanted to see if, as you suggest is possible, the small comparison is
degraded by my code as it seems it would be. So I re-tested and instead
of setting the length of the random memory to 100 bytes I varied this
length from 1 to 10 bytes. Anything over 3 bytes was faster; less than
that and the library was faster.

Length Speed of new version vs library
1 3 x slower
2 2 x slower
3 2 x slower
4 2.5 x faster
5 2.5 x faster
6 2 x faster
7 slightly faster
8 2.5 x faster
9 2 x faster
10 2 x faster
11 2 x faster

It may be debatable whether the implementors made the right choice -
what's the average length of data that memcmp() is used for? - I suspect a
lot more than 3 bytes - but I don't suppose it's really on topic here.
Similar considerations apply to memmov, memcpy, strlen, strcmp, strcpy,
strcat etc.

If you know that the original arguments were created by malloc, you then
know that they are suitably aligned for ints, and need only worry about
the postamble. But then you have the highly probable (and possibly
undetectable on your hardware) error of passing an argument not created
by malloc.

Remember KISS.

It's a good principle. I like to think of efficiency and speed too but
they often are synonymous with KISS.
 
R

Randy Howard

On my system, it doesn't cause an incorrect result

That's not the point. In clc, the goal is to demonstrate portable,
correct code, not something that happens to work on your chosen platform.
The "but it works on my system" defense is quite common, so it gets a
lot of (negative) attention.
What you say makes sense, although alignment isn't an issue in my case.

It *is* an issue, you simply decided not to care. Hopefully you won't
leave presents for future maintainers of your code down the road.
 
N

Netocrat

That's not the point. In clc, the goal is to demonstrate portable,
correct code, not something that happens to work on your chosen platform.
The "but it works on my system" defense is quite common, so it gets a lot
of (negative) attention.

I understand that... you have quoted the first part of my answer to the
question but you have chosen to ignore the rest which 'aligns' with the
goal of clc as far as I can tell:
However, I posted this code to find out how generally applicable it was,
and as you point out this is a general concern that my code doesn't deal
with.

It *is* an issue, you simply decided not to care. Hopefully you won't
leave presents for future maintainers of your code down the road.

I do care, which is why I'm posting to this group for feedback. You
have taken what I said out of context. To make it clear: I acknowledge
that alignment is an issue in general and in clc. I wouldn't use that
code, even if I intended to use it only on my system, given that it's been
pointed out that alignment is an issue. The code that I posted is not
being maintained, it is code created solely for testing and for feedback
here.

Let me put the context back into my statement. CBFalconer wrote: '[e]ven
if you are the implementor, and know all these things..' and I responded
with '...alignment isn't an issue in my case...'; meaning that if I were
the implementor, in my case alignment wouldn't be an issue. At least
that's what I intended; obviously it can be interpreted other ways: the
way you have interpreted it is without that "if" that I intended to be
implicit.
 
R

Randy Howard

I understand that... you have quoted the first part of my answer to the
question but you have chosen to ignore the rest which 'aligns' with the
goal of clc as far as I can tell:

I didn't ignore it, I simply didn't disagree with the rest. I try not
to include entire articles unless absolutely necessary. I did come on a
bit strong however, for which I apologize. I really reacted to the
hundreds of posts over the years that say "but they work on my system",
and pointed it all at you, which I should not have done.
To make it clear: I acknowledge
that alignment is an issue in general and in clc. I wouldn't use that
code, even if I intended to use it only on my system, given that it's been
pointed out that alignment is an issue. The code that I posted is not
being maintained, it is code created solely for testing and for feedback
here.

Fair enough.
 
N

Netocrat

I didn't ignore it, I simply didn't disagree with the rest. I try not to
include entire articles unless absolutely necessary. I did come on a bit
strong however, for which I apologize. I really reacted to the hundreds
of posts over the years that say "but they work on my system", and pointed
it all at you, which I should not have done.

No probs, I'm a new poster so you don't have much reference on my
perspective. It was good of you to be so gracious.
 
O

Old Wolf

Lew said:
That /should/ have been
int compare_strings(char *a, char *b)
{
while (*b && (*a++ == *b++)) ;
return (int)(*a-*b);
}

What is the purpose of that cast?
 
B

Ben Pfaff

Netocrat said:
Actually that's useless as a fix per your words - instead:
return *a == *b ? 0 : *a > *b ? 1 : -1;

There are probably cheaper ways to do this knowing details about the
implementation but this is hopefully a strictly compatible general
version.

Here's my favorite version. I got it from someone here on clc
but don't remember whom:
return *a < *b ? -1 : *a > *b;
 
C

CBFalconer

P

Peter Nilsson

CBFalconer said:
Try:
int compare_strings(char *a, char *b)
{
unsigned char *ua = a, *ub = b;
...
... No overflow. No casts.

Just a required diagnostic from the constraint violation. ;)
 
A

Al Bowers

Ben said:
Here's my favorite version. I got it from someone here on clc
but don't remember whom:
return *a < *b ? -1 : *a > *b;

It's ok, but I don't believe you would need the last comparison
if you check equaity(requiring a return of 0) within the loop.

while(*a++ == *b++)
if(*a == '\0') return 0;
return *a<*b?-1:1;
 
A

Al Bowers

Al said:
It's ok, but I don't believe you would need the last comparison
if you check equaity(requiring a return of 0) within the loop.

while(*a++ == *b++)
if(*a == '\0') return 0;
return *a<*b?-1:1;

That should be a for loop.
for( ; *a == *b; a++,b++)
if(*a == '\0') return 0;
return *a<*b?-1:1;
 
R

Rajan

Yadur,
I have written the strcmp function , let me just paste it here.
You don't need to do a seperate length comparison , you just need to
check for each of the arguments.

int strcmp(const char* a1, const char* b1)
{
while (*a1 == *b1)
{
if ((*a1 == '\0') && (*b1 == '\0'))
{
return 0;
}
a1++;
b1++;
}
return(*a1 - *b1);
}
 
P

pete

Rajan said:
Yadur,
I have written the strcmp function , let me just paste it here.
You don't need to do a seperate length comparison , you just need to
check for each of the arguments.

int strcmp(const char* a1, const char* b1)
{
while (*a1 == *b1)
{
if ((*a1 == '\0') && (*b1 == '\0'))

There's no point in checking both *a1 and *b1
for equality with zero,
since you already know that *a1 == *b1 at this point in code.
{
return 0;
}
a1++;
b1++;
}
return(*a1 - *b1);
}

That's wrong.
Any character compared with strcmp
has to give the same result as if compared by memcmp.

int str_cmp(const char *s1, const char *s2)
{
const unsigned char *p1 = (const unsigned char *)s1;
const unsigned char *p2 = (const unsigned char *)s2;

while (*p1 == *p2) {
if (*p1 == '\0') {
return 0;
}
++p1;
++p2;
}
return *p2 > *p1 ? -1 : 1;
}

N869
7.21.4 Comparison functions
[#1] The sign of a nonzero value returned by the comparison
functions memcmp, strcmp, and strncmp is determined by the
sign of the difference between the values of the first pair
of characters (both interpreted as unsigned char) that
differ in the objects being compared.
 
R

Rajan

Pete,
I think you should do a "man strcmp" and check because if whenever you
compare let's say "Raj" and "Rajendra" it will always give you the
difference as ascii value of '\0' - ascii value of 'e' which comes to
-101.
It's not the same as memcmp. memcmp compares the number of bytes for
any data type unlike strcmp which is specifically for const char*.
Write a simple program using strcmp function and check the retval.
 
P

pete

Ben said:
Here's my favorite version. I got it from someone here on clc
but don't remember whom:
return *a < *b ? -1 : *a > *b;

It wasn't me.

I usually write that as:
return *b > *a ? -1 : *a != *b;
Sometimes as:
return *b > *a ? -1 : *a > *b;

because I like to write sorting functions
in terms of only a GT(,) macro.

#define GT(A, B) (*(A) > *(B))
/* The above macro is for arithmetic types only */

static int comp(const void *arg1, const void *arg2)
{
return GT((e_type*)arg2, (e_type*)arg1)
? -1 : GT((e_type*)arg1, (e_type*)arg2);
}

The above comp function,
would be for a function with a qsort interface,
which is comparing arithmetic types.

For functions which only can sort one dimensional arrays,
the GT(,) macro can be used directly.

void s3sort(e_type *array, size_t nmemb)
{
e_type temp, *i, *j, *k, *after;

after = array + nmemb;
if (nmemb > (size_t)-1 / 3 - 1) {
nmemb = nmemb / 3 - 1;
} else {
nmemb = (3 * nmemb + 1) / 7;
}
while (nmemb != 0) {
i = array + nmemb;
do {
j = i - nmemb;
if (GT(j, i)) {
k = i;
temp = *k;
do {
*k = *j;
k = j;
if (nmemb + array > j) {
break;
}
j -= nmemb;
} while (GT(j, &temp));
*k = temp;
}
++i;
} while (i != after);
nmemb = (3 * nmemb + 1) / 7;
}
}

The GT(,) macro can be made more complicated to do other things
like to sort an array string pointers by string length.

#define GT(A, B) (lencomp((A), (B)) > 0)

int lencomp(const void *a, const void *b)
{
const size_t a_len = strlen(*(char **)a);
const size_t b_len = strlen(*(char **)b);

return b_len > a_len ? -1 : a_len != b_len;
}
 
P

pete

Rajan said:
Pete,
I think you should do a "man strcmp"

I think you should realise that the C standard defines the language,
and that "man" doesn't define the language.
The contents of your man pages are only
topical on this newsgroup to the extent
that they are right if they agree with the standard,
and that they are wrong if they disagree wiht the C standard.

ISO/IEC 9899:1999(E)
7.21.4 Comparison functions
1 The sign of a nonzero value returned by the comparison
functions memcmp, strcmp, and strncmp is determined by
the sign of the difference between the values of the first
pair of characters (both interpreted as unsigned char)
that differ in the objects being compared.
 
N

Netocrat

Now that's rare. Calling any of us "gracious". We usually only get
one-half of that number of letters.

But hang on, how do any of you get any programming done if you're all
Chief Information Officers.... oh, OK, not half of _those_ letters...
 
L

Lawrence Kirby

On Thu, 23 Jun 2005 05:46:57 +1000, Netocrat wrote:

....
Which I had; although correct me if you don't think it works as it should
(it tested fine)...

This makes big assumptions abut the representation of integers in the
platform. It won't work if they contain any padding bits or use anything
other than big endian byte order. And technically speaking C's type
aliasing rules disallow this.

Were you comparing equal data?

....
What you say makes sense, although alignment isn't an issue in my case. I
wanted to see if, as you suggest is possible, the small comparison is
degraded by my code as it seems it would be. So I re-tested and instead
of setting the length of the random memory to 100 bytes I varied this
length from 1 to 10 bytes. Anything over 3 bytes was faster; less than
that and the library was faster.

Length Speed of new version vs library
1 3 x slower
2 2 x slower
3 2 x slower
4 2.5 x faster
5 2.5 x faster
6 2 x faster
7 slightly faster
8 2.5 x faster
9 2 x faster
10 2 x faster
11 2 x faster

It may be debatable whether the implementors made the right choice -
what's the average length of data that memcmp() is used for? - I suspect a
lot more than 3 bytes - but I don't suppose it's really on topic here.

The average length of data being compared is not important for non-equal
data, what is important is the position of the first character that
differs. For "random" data this will be the first character tested the
great majority of the time.

These are techniques available for the implementation to use for
standard library functions if the implementor chooses. They are not
normally techniques a program should use.

Lawrence
 
L

Lawrence Kirby

Just a required diagnostic from the constraint violation. ;)

Fix that and add const because the string aren't being modified and we
might be getting somewhere. :)

Lawrence
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,166
Messages
2,570,907
Members
47,448
Latest member
DeanaQ4445

Latest Threads

Top