How is strlen implemented?

K

Keith Thompson

Stan Milam said:
Keith said:
Mark McIntyre said:
On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"

But from my experimental results, it seems
that strlen can still return the number of characters of a char array. [...]
I am just not sure whether I am just lucky or sth else happened inside
strlen.

lucky
No, if he'd been lucky it would have crashed the program (with a
meaningful diagnostic) rather than quietly returning a meaningless
result.

So, you are saying this is a poorly implemented compiler?

Not at all.

First, strlen() is part of the runtime library, not part of the
compiler.

An implementation of strlen() that was able to detect the case where
the argument points to the first element of an array that doesn't
contain any '\0' characters would most likely add significant overhead
to *all* operations. The obvious way to implement it is to make all
pointers "fat", so each pointer includes both the base address and
bounds information; strlen() would then have to check the bounds.
 
J

James McIninch

<posted & mailed>

By definition, a character array without a null terminator is not a string.

Calling strlen on somthing that isn't a string will cause undefined behavior
(an error).
 
M

Mark McIntyre

strlen() is almost certainly finding a zero byte immediately after his
array. I'd expect that to be a very common manifestation of the
undefined behavior in this case.

that comes under my definition of 'random' - its by chance finding a
null just shortly after the string, possibly due to some debugging
mode 'helpfulness'.

Of course, if the string were zero length, then....
:)
 
E

Emmanuel Delahaye

roy wrote on 23/04/05 :
Thanks. Maybe my question should be "what if the input is a char array
without a null terminator". But from my experimental results, it seems
that strlen can still return the number of characters of a char array.
I am just not sure whether I am just lucky or sth else happened inside
strlen.

If the string is malformed (missing terminating 0), the behaviour is
undefined. Any thing could happen.

--
Emmanuel
The C-FAQ: http://www.eskimo.com/~scs/C-faq/faq.html
The C-library: http://www.dinkumware.com/refxc.html

..sig under repair
 
E

Emmanuel Delahaye

Stan Milam wrote on 24/04/05 :
So, you are saying this is a poorly implemented compiler?

What would be a better implementation ? If the limit is not here,
anything happens. Blame the coder, not the compiler.

--
Emmanuel
The C-FAQ: http://www.eskimo.com/~scs/C-faq/faq.html
The C-library: http://www.dinkumware.com/refxc.html

"Clearly your code does not meet the original spec."
"You are sentenced to 30 lashes with a wet noodle."
-- Jerry Coffin in a.l.c.c++
 
E

Emmanuel Delahaye

Joe Estock wrote on 23/04/05 :
Interesting seeing \0 so widely in use. On most systems, NULL is defined as
\0, however there are a few special cases where it is not. Shouldn't we be
using NULL instead of \0?

No, because here, we are talking about the null character that is 0 or
'\0' (but I'm too lazy to type the latter).

--
Emmanuel
The C-FAQ: http://www.eskimo.com/~scs/C-faq/faq.html
The C-library: http://www.dinkumware.com/refxc.html

"C is a sharp tool"
 
G

Gregory Pietsch

I checked my libraries, and the following may be faster than the above:

#include <string.h>
#ifndef _OPTIMIZED_FOR_SIZE
#include <limits.h>
/* Nonzero if either X or Y is not aligned on a "long" boundary. */
#ifdef _ALIGN
#define UNALIGNED1(X) ((long)X&(sizeof(long)-1))
#else
#define UNALIGNED1(X) 0
#endif

/* Macros for detecting endchar */
#if ULONG_MAX == 0xFFFFFFFFUL
#define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)
#elif ULONG_MAX == 0xFFFFFFFFFFFFFFFFUL
/* Nonzero if X (a long int) contains a NULL byte. */
#define DETECTNULL(X) (((X) - 0x0101010101010101) & ~(X) &
0x8080808080808080)
#else
#define _OPTIMIZED_FOR_SIZE
#endif

#ifdef DETECTNULL
#define DETECTCHAR(X,MASK) DETECTNULL(X^MASK)
#endif

#endif
/* strlen */
size_t (strlen)(const char *s)
{
const char *t = s;
#ifndef _OPTIMIZED_FOR_SIZE
unsigned long *aligned_addr;

if (!UNALIGNED1(s)) {
aligned_addr = (unsigned long *) s;
while (!DETECTNULL(*aligned_addr))
aligned_addr++;
/* The block of bytes currently pointed to by aligned_addr
contains a null. We catch it using the bytewise search. */
s = (const char *) aligned_addr;
}
#endif
while (*s)
s++;
return (size_t) (s - t);
}

/* Gregory Pietsch */
 
G

Gregory Pietsch

NULL is usually reserved for the null pointer. Here, we're checking for
the null character, '\0'.

Gregory Pietsch
 
F

Flash Gordon

Gregory said:
I checked my libraries,

Do you mean your personal libraries or your implementations. Remember
that the implementation is allowed to do things you are not allowed to do.
> and the following may be faster than the above:

What above? Please quote enough of the message you are replying to for
us to see what you are talking about. There is an option that gets
Google to do the right thing and if you search the group I'm sure you
will find the instructions. It's in someone's sig, but I can't remember who.
#include <string.h>
#ifndef _OPTIMIZED_FOR_SIZE

An implementation could declare that or not for any reason it wants.
#include <limits.h>
/* Nonzero if either X or Y is not aligned on a "long" boundary. */
#ifdef _ALIGN

Again, a compiler could declare that or not as it saw fit.
#define UNALIGNED1(X) ((long)X&(sizeof(long)-1))

There is no guarantee that this will tell you if it is aligned. Some
people around here have worked on word addressed systems where the byte
within the word was flagged in the *high* bits of the address.
#else
#define UNALIGNED1(X) 0
#endif

/* Macros for detecting endchar */
#if ULONG_MAX == 0xFFFFFFFFUL
#define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)

Misleading name, I initially read that as a screwy attempt to detect a
NULL pointer. DETECTNULCHAR would be better.
#elif ULONG_MAX == 0xFFFFFFFFFFFFFFFFUL
/* Nonzero if X (a long int) contains a NULL byte. */
#define DETECTNULL(X) (((X) - 0x0101010101010101) & ~(X) &
0x8080808080808080)
#else
#define _OPTIMIZED_FOR_SIZE

Isn't that macro you are defining in the implementation name space?
Anything could happen.
#endif

#ifdef DETECTNULL
#define DETECTCHAR(X,MASK) DETECTNULL(X^MASK)
#endif

#endif
/* strlen */
size_t (strlen)(const char *s)
{
const char *t = s;
#ifndef _OPTIMIZED_FOR_SIZE
unsigned long *aligned_addr;

if (!UNALIGNED1(s)) {
aligned_addr = (unsigned long *) s;
while (!DETECTNULL(*aligned_addr))
aligned_addr++;

The above could read bytes off the end of a properly nul terminated
string. For example,
size_t len = strlen("a");
/* The block of bytes currently pointed to by aligned_addr
contains a null. We catch it using the bytewise search. */
s = (const char *) aligned_addr;
}
#endif
while (*s)
s++;
return (size_t) (s - t);

No need to cast the result of the subtraction. The compiler already
knows is is returning a size_t so will do the conversion anyway.
 
L

Lawrence Kirby

Stan Milam said:
Keith said:
On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
[...]

But from my experimental results, it seems
that strlen can still return the number of characters of a char array. [...]
I am just not sure whether I am just lucky or sth else happened inside
strlen.

lucky
No, if he'd been lucky it would have crashed the program (with a
meaningful diagnostic) rather than quietly returning a meaningless
result.

So, you are saying this is a poorly implemented compiler?

Not at all.

First, strlen() is part of the runtime library, not part of the
compiler.

It is part of the implementation which covers both compiler and library.
Many compilers can generate their own inline code for strlen() in which
case the "library" as a separate concept has little to do with it.

Lawrence
 
G

Gregory Pietsch

Flash said:
Do you mean your personal libraries or your implementations. Remember
that the implementation is allowed to do things you are not allowed
to do.

It was my implementation, based on unravelling the "while(*s)s++" loop.
What above? Please quote enough of the message you are replying to for
us to see what you are talking about. There is an option that gets
Google to do the right thing and if you search the group I'm sure you
will find the instructions. It's in someone's sig, but I can't remember who.


An implementation could declare that or not for any reason it wants.

If _OPTIMIZED_FOR_SIZE is declared, the implementation tries to unravel
the "while(*s)s++" loop somewhat.
Again, a compiler could declare that or not as it saw fit.

There's no way to portably detect whether a pointer-to-char is aligned
on a long boundary, is there?
There is no guarantee that this will tell you if it is aligned. Some
people around here have worked on word addressed systems where the byte
within the word was flagged in the *high* bits of the address.

I bet that makes for some funky internal pointer arithmetic!
Misleading name, I initially read that as a screwy attempt to detect a
NULL pointer. DETECTNULCHAR would be better.


Isn't that macro you are defining in the implementation name space?
Anything could happen.

I tried two types of optimizations, one for time (try to unravel the
loop) and one for size. If I don't get a kind of system where casting
a pointer-to-char to a pointer-to-unsigned-long doesn't make much
sense, #defining _OPTIMIZED_FOR_SIZE allows me to leave out code that
wouldn't work in that situation.
The above could read bytes off the end of a properly nul terminated
string. For example,
size_t len = strlen("a");

I'm testing for having a null character somewhere among the characters
that make up the area that aligned_addr points to. If I don't get a
sane environment (as indicated by the _OPTIMIZED_FOR_SIZE macro), this
code isn't even compiled in.

Here's the general idea: suppose, for example, sizeof(unsigned long) is
4. I can freely cast a pointer-to-char to a pointer-to-unsigned-long. I
don't care if *aligned_addr is big-end-aligned or little-end-aligned.
Oh, well, is there a better way to unravel "while(*s)s++"?
No need to cast the result of the subtraction. The compiler already
knows is is returning a size_t so will do the conversion anyway.

The cast is only for my eyes. ;-)

Gregory Pietsch
 
K

Keith Thompson

Lawrence Kirby said:
It is part of the implementation which covers both compiler and library.
Many compilers can generate their own inline code for strlen() in which
case the "library" as a separate concept has little to do with it.

You're right. I should have said that strlen() is *typically
implemented as* part of the runtime library, not part of the compiler.
(I don't know how many compilers generate inline code, and therefore
how accurate "typically" is.)
 
C

Christian Bau

"roy said:
Thanks. Maybe my question should be "what if the input is a char array
without a null terminator". But from my experimental results, it seems
that strlen can still return the number of characters of a char array.
I am just not sure whether I am just lucky or sth else happened inside
strlen.

You are not lucky, you are unlucky.

If you were lucky, your program would crash as soon as try this, and
then you would know there is a bug that needs fixing. If you are
unlucky, you get a result that doesn't show the bug.
 
T

Tim Prince

Keith Thompson said:
You're right. I should have said that strlen() is *typically
implemented as* part of the runtime library, not part of the compiler.
(I don't know how many compilers generate inline code, and therefore
how accurate "typically" is.)
Several common compilers, both commercial and free software, have both
in-line and library implementations, as provided for in standard C (both C89
and C99). In normal usage, not allowing for both possibilities would open
up the possibility of Undefined Behavior.
 
C

Chris Torek

There's no way to portably detect whether a pointer-to-char is aligned
on a long boundary, is there?

No (at least, not if by "portable" you mean what we usually do in
comp.lang.c :) ... there are versions that are "portable" to those
systems that define an alignment function or macro, such as all
the BSD variants).

[code using things like]
I tried two types of optimizations, one for time (try to unravel the
loop) and one for size. ...
Here's the general idea: suppose, for example, sizeof(unsigned long) is
4. I can freely cast a pointer-to-char to a pointer-to-unsigned-long. I
don't care if *aligned_addr is big-end-aligned or little-end-aligned.
Oh, well, is there a better way to unravel "while(*s)s++"?

Maybe, maybe not. It is quite CPU-dependent.

For whatever it is worth (perhaps not much at this point), I tried
the above trick in SPARC assembly code when I was writing the 4.4BSD
C library routines for the SPARC. (I wrote many of the "portable"
routines as well; we set things up so that when you built for VAX,
Tahoe, or SPARC, you got either the machine-specific version or the
generic, depending on whether we had written a machine-specific
version.)

The result was that the fancy version using "four byte at a time"
scans (on aligned pointers) was significantly *slower* than the
dumb, simple, one-byte-at-a-time version, even for relatively long
strings. I was a bit surprised; and the results might be different
on a more modern CPU (this was back in 1991 or so).

(I wrote the whole thing in assembly -- well, in C at first, compiled
to assembly, then hand-edited -- so I know it was not the compiler
doing anything tricky, either.)

It turns out that in most C programs, most strings are very short.
The "Dhrystone" tests that many people used to use to compare C
library implementations use strings that are significantly longer
than average, and overemphasize the time behavior of strlen(),
strcpy(), and strcmp() on relatively long strings. Even for these
longer strings, the "optimized" strlen() was still slower.

Of course, this "most C strings are short" rule of thumb may come
about because most C libraries are optimized for short strings
because most strings are short because most C libraries are optimized
for short strings, etc. :) In other words, if you have a lot of
long strings, and you do program optimization, you will avoid
calling strlen() on them so much.

Even if one breaks this initial chicken-and-egg loop (by calling
strlen() repeatedly on long strings), and then optimizes the heck
out of strlen(), one can probably still speed up one's programs by
fixing the repeated calls to strlen(). There is another rule of
thumb that applies beyond just C programming, or even computers:

The shortest, fastest, cheapest, and most reliable parts of
any system are the ones that are not there.

(This is another way of putting the "KISS" principle. Of course,
marketing usually gets in the way of this idea. :) )
 
P

Peter Ammon

Keith said:
Stan Milam said:
Keith said:
On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"

[...]


But from my experimental results, it seems
that strlen can still return the number of characters of a char array.
[...]
I am just not sure whether I am just lucky or sth else happened inside
strlen.

lucky

No, if he'd been lucky it would have crashed the program (with a
meaningful diagnostic) rather than quietly returning a meaningless
result.

So, you are saying this is a poorly implemented compiler?


Not at all.

First, strlen() is part of the runtime library, not part of the
compiler.

An implementation of strlen() that was able to detect the case where
the argument points to the first element of an array that doesn't
contain any '\0' characters would most likely add significant overhead
to *all* operations. The obvious way to implement it is to make all
pointers "fat", so each pointer includes both the base address and
bounds information; strlen() would then have to check the bounds.

A simpler way would be to insert a padding byte containing zero after
every char array.

-Peter
 
C

CBFalconer

Chris said:
.... snip ...

Even if one breaks this initial chicken-and-egg loop (by calling
strlen() repeatedly on long strings), and then optimizes the heck
out of strlen(), one can probably still speed up one's programs by
fixing the repeated calls to strlen(). There is another rule of
thumb that applies beyond just C programming, or even computers:

The shortest, fastest, cheapest, and most reliable parts of
any system are the ones that are not there.

(This is another way of putting the "KISS" principle. Of course,
marketing usually gets in the way of this idea. :) )

My suggestion is to try to return the length from most routines
that uncover it, in place of an insipid pointer to the original
string. strlcpy and strlcat follow this practice. So do printf
and sprintf.
 
P

pete

Stan said:
Okay guys, that was a joke.

No, it wasn't.
Your posts in the "C FAQ 3.1" thread show that you don't see
the beauty of the concept of undefined behavior.

If you're going to write bad code,
then the C standard committee doesn't care about
what happens as a consequence.

This philosophy was in C originally,
and is maintained in the current C99 standard.

It's not that R was in too much of a hurry specifying C,
so that he didn't have enough time
to also specify what garbage code should do,
but rather it's the case that compiler writers
are in too much of a hurry writing compilers
to want to care about how to translate garbage code.
 
P

pete

Gregory said:
There has to be a null terminator somewhere.

Here's a short implementation:

#include <string.h>
size_t (strlen)(char *s)
{
char *p = s;

while (*p != '\0')
p++;
return (size_t)(p - s);
}

The ptrdiff_t type of (p - s) disqualifies this code
from being an example of portable C code.

If the following description of undefined behavior doesn't
apply to your code, then it doesn't apply to anything.


N869
6.5.6 Additive operators
[#9] When two pointers are subtracted, both shall point to
elements of the same array object, or one past the last
element of the array object; the result is the difference
of the subscripts of the two array elements. The size of
the
result is implementation-defined, and its type (a signed
integer type) is ptrdiff_t defined in the <stddef.h> header.
If the result is not representable in an object of that
type, the behavior is undefined.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,164
Messages
2,570,901
Members
47,439
Latest member
elif2sghost

Latest Threads

Top