string comparison

Y

yadurajj

Hello i am newbie trying to learn C..I need to know about string
comparisons in C, without using a library function,...recently I was
asked this in an interview..I can write a small program but I was told
that wouldn't it be wise to first get the length of the strings..if it
doesn't match then they are not the same..I agreed...then he said..but
again that would be an overhead first measuring the length...and then
doing a character by character comparison...I was confused and was
wondering if anybody has an answer to this theory. I am only trying to
undestand...

thanks
 
L

Lew Pitcher

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello i am newbie trying to learn C..I need to know about string
comparisons in C, without using a library function,...recently I was
asked this in an interview..I can write a small program but I was told
that wouldn't it be wise to first get the length of the strings..if it
doesn't match then they are not the same..I agreed...then he said..but
again that would be an overhead first measuring the length...and then
doing a character by character comparison...I was confused and was
wondering if anybody has an answer to this theory. I am only trying to
undestand...

Consider this: in C, a string is just an array of char, with zero or more
significant characters, and a '\0' character which terminates the string.

The length of the string is the count of the significant characters, up to
(but not including) the '\0' character.

It is true that, if the length of one string is different from the length of
another string, then the strings are different.

However, all that is really saying is that, in a character-by-character
comparison, where one string has a '\0', the other does not.

So, yes, it would be ideal to determine the lengths of the strings, but you
can only do so through a character-by-character inspection of each string.
If you are already going to examine the strings character-by-character,
comparing each character position to '\0' (to find the end, and thus the
length, of each string), there is no reason why you should not compare each
character of one string to the correspondingly placed character of the other.

Indeed, identical strings will compare identically up to and including the
'\0' character of each string. Different strings will compare identical up to
the first character difference, which may include the '\0' of one string
comparing unequal to the corresponding character of another string.


A visual example might help

String 1 A B C \0
: : : : Identical
String 2 A B C \0

or

String 1 A B C \0
: : : x Different
String 2 A B C D \0

or

String 1 A B Q \0
: : x Different
String 2 A B C \0

or even

String 1 A B Q \0
x Different
String 2 Z Y X \0

- --
Lew Pitcher

Master Codewright & JOAT-in-training | GPG public key available on request
Registered Linux User #112576 (http://counter.li.org/)
Slackware - Because I know what I'm doing.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFCuM2nagVFX4UWr64RApU+AJ4lI1+ui+FjPPN29OMYukiDF/QVsQCeIv9k
EhRxPiJaixr9YT9PivVRUkw=
=EpUj
-----END PGP SIGNATURE-----
 
S

SM Ryan

(e-mail address removed) wrote:
# Hello i am newbie trying to learn C..I need to know about string
# comparisons in C, without using a library function,...recently I was
# asked this in an interview..I can write a small program but I was told
# that wouldn't it be wise to first get the length of the strings..if it

The interviewer is either a jackass (not uncommon) or trying to trick
you (also not uncommon). In general the only way to measure strings
in C is to scan them from the first character to the null character.
If you measure the strings first you're going to end up doing the same
scan twice.

In ANSI C, a string comparison function can look something like
int compare(char *a,char *b) {
int cc = 0; while (cc==0 && *a && *b) cc = *a++ - *b++;
if (cc!=0) ;
else if (*a) cc = 1;
else if (*b) cc = -1;
return cc;
}
Sometimes you can compare words in machine dependent fashion and get
a four or eight time speed up.
 
L

Lew Pitcher

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

SM Ryan wrote:
[snip]
In ANSI C, a string comparison function can look something like
int compare(char *a,char *b) {
int cc = 0; while (cc==0 && *a && *b) cc = *a++ - *b++;
if (cc!=0) ;
else if (*a) cc = 1;
else if (*b) cc = -1;
return cc;
}

Or even

int compare_strings(char *a, char *b)
{
while (*a++ == *b++) ;
return (int)(*a-*b);
}

- --
Lew Pitcher

Master Codewright & JOAT-in-training | GPG public key available on request
Registered Linux User #112576 (http://counter.li.org/)
Slackware - Because I know what I'm doing.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFCuNMSagVFX4UWr64RAmRkAKDbXBeL3v3qA5Q0zWfjee9pgU4LGgCg4u5y
yykp27lAPLGzgEtoXL7eTYw=
=FAF0
-----END PGP SIGNATURE-----
 
K

kaby

I suggest you may use *long* instead of *char* to get full capability
of register.
and keep one eye on snaping 32bit and 8bit.
 
P

Peter Nilsson

Lew said:
SM Ryan wrote:
[snip]
In ANSI C, a string comparison function can look something like
int compare(char *a,char *b) {
int cc = 0; while (cc==0 && *a && *b) cc = *a++ - *b++;
if (cc!=0) ;
else if (*a) cc = 1;
else if (*b) cc = -1;
return cc;
}

Or even

int compare_strings(char *a, char *b)
{
while (*a++ == *b++) ;
return (int)(*a-*b);
}

I'd mark both of these as wrong. Neither replicates strcmp(). ;)
 
N

Netocrat

Lew said:
SM Ryan wrote:
[snip]
In ANSI C, a string comparison function can look something like
int compare(char *a,char *b) {
int cc = 0; while (cc==0 && *a && *b) cc = *a++ - *b++; if (cc!=0) ;
else if (*a) cc = 1;
else if (*b) cc = -1;
return cc;
}
}
Or even

int compare_strings(char *a, char *b) {
while (*a++ == *b++) ;
return (int)(*a-*b);
}
}
I'd mark both of these as wrong. Neither replicates strcmp(). ;)

The bug in the second is easy to spot, and from what I can tell the first
compare() differs from strcmp() in that strcmp() treats its arguments as
unsigned char *, whereas compare() treats them as they are when passed in:
char *.

Is that what you were referring to? If so, the problem seems to be solved
by changing the first line and adding two:
int compare(char *a,char *b) {
becomes
int compare(char *_a,char *_b) {
unsigned char *a = (unsigned char *)_a;
unsigned char *b = (unsigned char *)_b;

I'm just going by my gcc implementation - I don't know whether the
standard requires strcmp to behave this way...
 
L

Lawrence Kirby

Hello i am newbie trying to learn C..I need to know about string
comparisons in C, without using a library function,...recently I was
asked this in an interview..I can write a small program but I was told
that wouldn't it be wise to first get the length of the strings..if it
doesn't match then they are not the same.

That's a possible strategy if you are only testing for equality. But if
you also need to determine which of 2 unequal strings is greater then it
doesn't help.
.I agreed...then he said..but
again that would be an overhead first measuring the length...and then
doing a character by character comparison...I was confused and was
wondering if anybody has an answer to this theory. I am only trying to
undestand...

You have to consider whether the strings are likely to differ near the
start. If you have a 100 character string and they differed in the 2nd
character then determining the length would be a very costly operation
compared to just comparing the character sequences. And strings usually
do differ in the first couple of characters unless there is a strong
relationship between them already.

Lawrence
 
L

Lew Pitcher

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Lew said:
SM Ryan wrote:
[snip]


Or even

int compare_strings(char *a, char *b)
{
while (*a++ == *b++) ;
return (int)(*a-*b);
}

Gak!!

That's what I get for off-the-cuff coding when I'm tired.

That /should/ have been
int compare_strings(char *a, char *b)
{
while (*b && (*a++ == *b++)) ;
return (int)(*a-*b);
}

- --

Lew Pitcher, IT Specialist, Enterprise Data Systems
Enterprise Technology Solutions, TD Bank Financial Group

(Opinions expressed here are my own, not my employer's)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)

iD8DBQFCuVG5agVFX4UWr64RAvPTAJwKHRRb0JTB+zxtf8iHgrAAfRUmpwCfYm8O
QFVzBgyX10UnnW68nJ6iOeU=
=JwjB
-----END PGP SIGNATURE-----
 
W

websnarf

Hello i am newbie trying to learn C..I need to know about string
comparisons in C, without using a library function,...recently I was
asked this in an interview..I can write a small program but I was told
that wouldn't it be wise to first get the length of the strings..if it
doesn't match then they are not the same..I agreed...then he said..but
again that would be an overhead first measuring the length...

It is *highly* unlikely that this comment was carried to its logical
conclusion. If he's talking about "performance", then there is a lot
more to this question than this.
[...] and then
doing a character by character comparison...I was confused and was
wondering if anybody has an answer to this theory. I am only trying to
undestand...

Ok, first of all, if the lengths are intrinsically available to you in
the first place (in most situations, when you are in control of all the
code and data formats, this is generally trivial to guarantee), then
sure, you can and perhaps should first compare the lengths, afterwhich
you want to perform the equivalent of a memcmp().

If you don't have the length, and your strings are just '\0' terminated
char * strings, then performing strlen()'s first and precomparing will
never be better in terms of performance. The reason is that the whole
cost of string comparison is the loop overhead, and memory traversal.
By calling strlen twice, you are basically (roughly) tripling the loop
overhead, and doubling the memory traversal cost.

In terms of amount of work done, the moment you have enough information
to know the two lengths are different, is the same moment you can know
that the substance of the two strings are different.

Ok, so here are some remaining questions:

1) How should one implement a standard strcmp? -- as I recall, this is
like 3 lines of code.

2) How might one implement a standard memcmp? (On the assumption that
lengths are always available to you, since it basically costs nothing
for this to be the case.) Doing this naively is easy; but trying to
take advantage of low-level hardware characteristics, such as
alignment, this can be hard if you are really intent on squeezing out
the maximum performance.

3) Is this string comparison isolated? For example, let us say that
you are performing a string comparison for the purpose of inserting
into a hash table to avoid inserting duplicates. Well in this case a
scalar "trace" of the string is precomputed anyways in the form of its
hash function mapping. So if you are store the hash values along with
each string then doing this pre-compare (like the length pre-compare,
except you should expect this to have far fewer false positives) will
improve average comparison performance.
 
L

Lawrence Kirby

I suggest you may use *long* instead of *char* to get full capability
of register.
and keep one eye on snaping 32bit and 8bit.

Unless the OP is already familiar with the technique I think you are
talking about this isn't likely to mean very much. I am familiar with
techniques for handling character data a word at a time, but even then
this doesn't make an awful lot of sense. :)

I wouldn't worry about this for the purposes of an interview question,
especially when the issue is whether to determine the length of the string
first.

Lawrence
 
L

Lawrence Kirby

Lew said:
SM Ryan wrote:
[snip]
In ANSI C, a string comparison function can look something like
int compare(char *a,char *b) {
int cc = 0; while (cc==0 && *a && *b) cc = *a++ - *b++; if (cc!=0) ;
else if (*a) cc = 1;
else if (*b) cc = -1;

These don't take into account the relative values of *a and *b so are
wrong.
The bug in the second is easy to spot,

Which of the bugs in the second are you referring to? :)

It will walk off the end of identical strings.

There are 2 instances of an off-by-one index error in the return
statement.

C doesn't guarantee that char has a smaller range than int so *a-*b could
overflow. You can usually get away with this in practice but at least a
comment is in order.

Since char values can be negative it has a curious property that a
longer string can compare less than a shorter string that is a prefix of
it (assuming a trivial fix for bug 2).
and from what I can tell the first
compare() differs from strcmp() in that strcmp() treats its arguments as
unsigned char *, whereas compare() treats them as they are when passed in:
char *.

True. And using unsigned char eliminates the "curious property".
Is that what you were referring to? If so, the problem seems to be
solved
by changing the first line and adding two:
int compare(char *a,char *b) {
becomes
int compare(char *_a,char *_b) {
unsigned char *a = (unsigned char *)_a; unsigned char *b = (unsigned
char *)_b;

I'm just going by my gcc implementation - I don't know whether the
standard requires strcmp to behave this way...

strcmp() is defined to act on unsigned char values.

Lawrence
 
N

Netocrat

Lew Pitcher wrote:
SM Ryan wrote:
[snip]
In ANSI C, a string comparison function can look something like
int compare(char *a,char *b) {
int cc = 0; while (cc==0 && *a && *b) cc = *a++ - *b++; if (cc!=0)
; else if (*a) cc = 1;
else if (*b) cc = -1;

These don't take into account the relative values of *a and *b so are
wrong.

At the risk of sounding clueless... I'm uncertain of firstly what you are
referring to by 'these' and secondly in any event how the relative values
aren't being taken into account. Perhaps by the second you mean that *a
and *b need to be treated as unsigned char rather than char, as you
confirm is the case later in your post?
Which of the bugs in the second are you referring to? :)

It will walk off the end of identical strings.

That's where I stopped looking...
There are 2 instances of an off-by-one index error in the return
statement.

C doesn't guarantee that char has a smaller range than int so *a-*b
could overflow. You can usually get away with this in practice but at
least a comment is in order.

Since char values can be negative it has a curious property that a
longer string can compare less than a shorter string that is a prefix of
it (assuming a trivial fix for bug 2).

Here's a fix then (if this were for real-life code I'd rewrite it so the
parameters were a and b rather than _a and _b but this way involves
minimal change...):

int compare_strings(char *_a, char *_b) {
unsigned char *a = (unsigned char *)_a;
unsigned char *b = (unsigned char *)_b;
while (*a && *b && *a == *b) {
a++;
b++;
}
return (int)*a-(int)*b;
}
True. And using unsigned char eliminates the "curious property".

Truth be told that "curious property" is the means by which I discovered
the bug. For any string of printable characters the original function is
fine but I didn't doubt that Peter had a reason for quibbling with it, so
I dug deeper...

[snip]
strcmp() is defined to act on unsigned char values.

Cheers for the confirmation.
 
N

Netocrat

Lew said:
SM Ryan wrote:
[snip]
In ANSI C, a string comparison function can look something like
int compare(char *a,char *b) {
int cc = 0; while (cc==0 && *a && *b) cc = *a++ - *b++; if (cc!=0) ;
else if (*a) cc = 1;
else if (*b) cc = -1;
return cc;
}
}

Or even

int compare_strings(char *a, char *b) {
while (*a++ == *b++) ;
return (int)(*a-*b);
}
}
Gak!!

That's what I get for off-the-cuff coding when I'm tired.

That /should/ have been

Hate to do this to you, but I think you need more sleep :p

The increments should occur within the loop, not in the test, and as has
been discussed in other posts, the comparison should be done treating the
chars as unsigned. Finally Lawrence Kirby pointed out that strictly
speaking char is not guaranteed to have smaller range than int, so the
return statement is not necessarily correct, but usually will be.
int compare_strings(char *a, char *b) {
while (*b && (*a++ == *b++)) ;
return (int)(*a-*b);
}

Perhaps:

int compare_strings(char *a, char *b) {
unsigned char *ua = (unsigned char *)a;
unsigned char *ub = (unsigned char *)b;
while (*ub && *ua == *ub) {
ua++;
ub++;
}
return (int)*ua-(int)*ub;
}
 
N

Netocrat

C doesn't guarantee that char has a smaller range than int so *a-*b
could overflow. You can usually get away with this in practice but at
least a comment is in order.

Here's a fix then [snip]
return (int)*a-(int)*b;

Actually that's useless as a fix per your words - instead:
return *a == *b ? 0 : *a > *b ? 1 : -1;

There are probably cheaper ways to do this knowing details about the
implementation but this is hopefully a strictly compatible general
version.
 
N

Netocrat

Ok, so here are some remaining questions:

1) How should one implement a standard strcmp? -- as I recall, this is
like 3 lines of code.

I doubt it can be done in 3 lines of code. For a start the arguments to
strcmp are consts so you need at least two lines of variable declarations.
There are a few different implementations - not using const parameters -
scattered through this thread.
2) How might one implement a standard memcmp? (On the assumption that
lengths are always available to you, since it basically costs nothing
for this to be the case.)

No need for assumptions, length is one of the parameters to memcmp.
Doing this naively is easy; but trying to
take advantage of low-level hardware characteristics, such as alignment,
this can be hard if you are really intent on squeezing out the maximum
performance.

If it were 'standard' surely it wouldn't have knowledge of low-level
hardware characteristics. Perhaps I misunderstand you. Anyhow on my
platform (Pentium 4 running Linux with gcc 3.3.5) you can squeeze quite a
big performance gain by knowing that the word-size of the machine is equal
to sizeof(int).

The, as you put it naive, but portable attempt:

int memcmp(const void *a, const void *b, size_t n) {
unsigned char *ua = (unsigned char *)a;
unsigned char *ub = (unsigned char *)b;
unsigned char *ua_max = ua + n;

while (ua < ua_max) {
if (*ua > *ub)
return 1;
else if (*ua < *ub)
return -1;
else {
ua++;
ub++;
}
}
return 0;
}

I compared the speed of this implementation to the glibc memcmp by running
memcmps of 50 million (less 100) pairs of different memory locations
containing random data of length 100 bytes.

It turns out that the glibc version is roughly 10-15% faster - no big deal.

So I tested your idea of using alignment, knowing that my processor likes
to operate on 4 bytes at a time, being equivalent to sizeof(int).

int memcmp_using_int(const void *a, const void *b, size_t n) {
int num_ints = n / sizeof(int);
int extra_bytes = n % sizeof(int);
unsigned int *ua = (unsigned int *)a;
unsigned int *ub = (unsigned int *)b;
unsigned int *ua_max = ua + num_ints;
unsigned char *uac;
unsigned char *ubc;
unsigned char *uac_max;

while (ua < ua_max) {
if (*ua > *ub)
return 1;
else if (*ua < *ub)
return -1;
else {
ua++;
ub++;
}
}
uac = (unsigned char *)ua;
ubc = (unsigned char *)ub;
uac_max = uac + extra_bytes;
while (uac < uac_max) {
if (*uac > *ubc)
return 1;
else if (*uac < *ubc)
return -1;
else {
uac++;
ubc++;
}
}
return 0;
}

This version is approximately 3 times faster than glibc and my original
version, which surprised me. Is there any reason to _not_ make the
assumption that I made - that we should operate on units of sizeof(int)
rather than sizeof(char)? It may not necessarily improve performance - it
certainly improves it on my setup - but could it degrade it?

Are there any other ways to work out - without knowing in advance - an
appropriate unit for your system? I happened to know that my machine's
wordsize is sizeof(int), but can you work out or approximate what it is
from standard includes/definitions?
 
C

CBFalconer

Netocrat said:
On Wed, 22 Jun 2005 07:55:37 -0400, Lew Pitcher wrote:
[snip]
Gak!!

That's what I get for off-the-cuff coding when I'm tired.

That /should/ have been

Hate to do this to you, but I think you need more sleep :p

The increments should occur within the loop, not in the test, and
as has been discussed in other posts, the comparison should be
done treating the chars as unsigned. Finally Lawrence Kirby
pointed out that strictly speaking char is not guaranteed to have
smaller range than int, so the return statement is not necessarily
correct, but usually will be.
int compare_strings(char *a, char *b) {
while (*b && (*a++ == *b++)) ;
return (int)(*a-*b);
}

Perhaps:

int compare_strings(char *a, char *b) {
unsigned char *ua = (unsigned char *)a;
unsigned char *ub = (unsigned char *)b;
while (*ub && *ua == *ub) {
ua++;
ub++;
}
return (int)*ua-(int)*ub;
}

Still wrong. Try:

int compare_strings(char *a, char *b)
{
unsigned char *ua = a, *ub = b;

while ((*ua == *ub) && *ua) {
ua++; ub++;
}
return (*ua > *ub) - (*ub > *ua);
}

notice the placement of the test for '\0'. No overflow. No casts.

--
Some informative links:
http://www.geocities.com/nnqweb/
http://www.catb.org/~esr/faqs/smart-questions.html
http://www.caliburn.nl/topposting.html
http://www.netmeister.org/news/learn2quote.html
 
N

Netocrat

Still wrong. Try:

int compare_strings(char *a, char *b) {
unsigned char *ua = a, *ub = b;

while ((*ua == *ub) && *ua) {
ua++; ub++;
}
return (*ua > *ub) - (*ub > *ua);
}
}
notice
the placement of the test for '\0'.

I see no reason for this rearrangement... logically it has the same
result. Whether the loop ends because *ua != *ub or because *ua == '\0'
or because as in the case of the original *ub == '\0', the return code
does the same thing... so why does it matter?
No overflow.

Yes, I noted elsewhere that I stuffed the overflow protection up. Your
solution is elegant; mine requires less computation - ignoring the
effects of optimisation. It was (with different variable names):
return *a == *b ? 0 : *a > *b ? 1 : -1;
No casts.

I use gcc's -Wall option.

Without the casts I get:

warning: pointer targets in initialization differ in signedness

which I'd rather not see, but point taken, they're unnecessary.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,166
Messages
2,570,907
Members
47,448
Latest member
DeanaQ4445

Latest Threads

Top