How is strlen implemented?

Keith Thompson · Apr 24, 2005

Stan Milam said:
Keith said:

Mark McIntyre said:

On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"

But from my experimental results, it seems
that strlen can still return the number of characters of a char array. [...]
I am just not sure whether I am just lucky or sth else happened inside
strlen.

lucky

Click to expand...

No, if he'd been lucky it would have crashed the program (with a
meaningful diagnostic) rather than quietly returning a meaningless
result.

Click to expand...

So, you are saying this is a poorly implemented compiler?

Not at all.

First, strlen() is part of the runtime library, not part of the
compiler.

An implementation of strlen() that was able to detect the case where
the argument points to the first element of an array that doesn't
contain any '\0' characters would most likely add significant overhead
to *all* operations. The obvious way to implement it is to make all
pointers "fat", so each pointer includes both the base address and
bounds information; strlen() would then have to check the bounds.

James McIninch · Apr 24, 2005

<posted & mailed>

By definition, a character array without a null terminator is not a string.

Calling strlen on somthing that isn't a string will cause undefined behavior
(an error).

Mark McIntyre · Apr 24, 2005

strlen() is almost certainly finding a zero byte immediately after his
array. I'd expect that to be a very common manifestation of the
undefined behavior in this case.

that comes under my definition of 'random' - its by chance finding a
null just shortly after the string, possibly due to some debugging
mode 'helpfulness'.

Of course, if the string were zero length, then....

Emmanuel Delahaye · Apr 24, 2005

roy wrote on 23/04/05 :

Thanks. Maybe my question should be "what if the input is a char array
without a null terminator". But from my experimental results, it seems
that strlen can still return the number of characters of a char array.
I am just not sure whether I am just lucky or sth else happened inside
strlen.

If the string is malformed (missing terminating 0), the behaviour is
undefined. Any thing could happen.

--
Emmanuel
The C-FAQ: http://www.eskimo.com/~scs/C-faq/faq.html
The C-library: http://www.dinkumware.com/refxc.html

..sig under repair

Emmanuel Delahaye · Apr 24, 2005

Stan Milam wrote on 24/04/05 :

So, you are saying this is a poorly implemented compiler?

What would be a better implementation ? If the limit is not here,
anything happens. Blame the coder, not the compiler.

--
Emmanuel
The C-FAQ: http://www.eskimo.com/~scs/C-faq/faq.html
The C-library: http://www.dinkumware.com/refxc.html

"Clearly your code does not meet the original spec."
"You are sentenced to 30 lashes with a wet noodle."
-- Jerry Coffin in a.l.c.c++

Emmanuel Delahaye · Apr 24, 2005

Joe Estock wrote on 23/04/05 :

Interesting seeing \0 so widely in use. On most systems, NULL is defined as
\0, however there are a few special cases where it is not. Shouldn't we be
using NULL instead of \0?

No, because here, we are talking about the null character that is 0 or
'\0' (but I'm too lazy to type the latter).

--
Emmanuel
The C-FAQ: http://www.eskimo.com/~scs/C-faq/faq.html
The C-library: http://www.dinkumware.com/refxc.html

"C is a sharp tool"

Stan Milam · Apr 24, 2005

Stan said:
Keith Thompson wrote:

So, you are saying this is a poorly implemented compiler?

Okay guys, that was a joke.

Gregory Pietsch · Apr 25, 2005

I checked my libraries, and the following may be faster than the above:

#include <string.h>
#ifndef _OPTIMIZED_FOR_SIZE
#include <limits.h>
/* Nonzero if either X or Y is not aligned on a "long" boundary. */
#ifdef _ALIGN
#define UNALIGNED1(X) ((long)X&(sizeof(long)-1))
#else
#define UNALIGNED1(X) 0
#endif

/* Macros for detecting endchar */
#if ULONG_MAX == 0xFFFFFFFFUL
#define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)
#elif ULONG_MAX == 0xFFFFFFFFFFFFFFFFUL
/* Nonzero if X (a long int) contains a NULL byte. */
#define DETECTNULL(X) (((X) - 0x0101010101010101) & ~(X) &
0x8080808080808080)
#else
#define _OPTIMIZED_FOR_SIZE
#endif

#ifdef DETECTNULL
#define DETECTCHAR(X,MASK) DETECTNULL(X^MASK)
#endif

#endif
/* strlen */
size_t (strlen)(const char *s)
{
const char *t = s;
#ifndef _OPTIMIZED_FOR_SIZE
unsigned long *aligned_addr;

if (!UNALIGNED1(s)) {
aligned_addr = (unsigned long *) s;
while (!DETECTNULL(*aligned_addr))
aligned_addr++;
/* The block of bytes currently pointed to by aligned_addr
contains a null. We catch it using the bytewise search. */
s = (const char *) aligned_addr;
}
#endif
while (*s)
s++;
return (size_t) (s - t);
}

/* Gregory Pietsch */

Gregory Pietsch · Apr 25, 2005

NULL is usually reserved for the null pointer. Here, we're checking for
the null character, '\0'.

Gregory Pietsch

Flash Gordon · Apr 25, 2005

Gregory said:
I checked my libraries,

Do you mean your personal libraries or your implementations. Remember
that the implementation is allowed to do things you are not allowed to do.

> and the following may be faster than the above:

What above? Please quote enough of the message you are replying to for
us to see what you are talking about. There is an option that gets
Google to do the right thing and if you search the group I'm sure you
will find the instructions. It's in someone's sig, but I can't remember who.

#include <string.h>
#ifndef _OPTIMIZED_FOR_SIZE

An implementation could declare that or not for any reason it wants.

#include <limits.h>
/* Nonzero if either X or Y is not aligned on a "long" boundary. */
#ifdef _ALIGN

Again, a compiler could declare that or not as it saw fit.

#define UNALIGNED1(X) ((long)X&(sizeof(long)-1))

There is no guarantee that this will tell you if it is aligned. Some
people around here have worked on word addressed systems where the byte
within the word was flagged in the *high* bits of the address.

#else
#define UNALIGNED1(X) 0
#endif

/* Macros for detecting endchar */
#if ULONG_MAX == 0xFFFFFFFFUL
#define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)

Misleading name, I initially read that as a screwy attempt to detect a
NULL pointer. DETECTNULCHAR would be better.

#elif ULONG_MAX == 0xFFFFFFFFFFFFFFFFUL
/* Nonzero if X (a long int) contains a NULL byte. */
#define DETECTNULL(X) (((X) - 0x0101010101010101) & ~(X) &
0x8080808080808080)
#else
#define _OPTIMIZED_FOR_SIZE

Isn't that macro you are defining in the implementation name space?
Anything could happen.

#endif

#ifdef DETECTNULL
#define DETECTCHAR(X,MASK) DETECTNULL(X^MASK)
#endif

#endif
/* strlen */
size_t (strlen)(const char *s)
{
const char *t = s;
#ifndef _OPTIMIZED_FOR_SIZE
unsigned long *aligned_addr;

if (!UNALIGNED1(s)) {
aligned_addr = (unsigned long *) s;
while (!DETECTNULL(*aligned_addr))
aligned_addr++;

The above could read bytes off the end of a properly nul terminated
string. For example,
size_t len = strlen("a");

/* The block of bytes currently pointed to by aligned_addr
contains a null. We catch it using the bytewise search. */
s = (const char *) aligned_addr;
}
#endif
while (*s)
s++;
return (size_t) (s - t);

No need to cast the result of the subtraction. The compiler already
knows is is returning a size_t so will do the conversion anyway.

Lawrence Kirby · Apr 25, 2005

Stan Milam said:
Stan Milam said:

Keith said:

On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
[...]

But from my experimental results, it seems
that strlen can still return the number of characters of a char array. [...]
I am just not sure whether I am just lucky or sth else happened inside
strlen.

lucky
No, if he'd been lucky it would have crashed the program (with a
meaningful diagnostic) rather than quietly returning a meaningless
result.

Click to expand...

So, you are saying this is a poorly implemented compiler?

Click to expand...

Not at all.

First, strlen() is part of the runtime library, not part of the
compiler.

It is part of the implementation which covers both compiler and library.
Many compilers can generate their own inline code for strlen() in which
case the "library" as a separate concept has little to do with it.

Lawrence

Gregory Pietsch · Apr 25, 2005

Flash said:
Do you mean your personal libraries or your implementations. Remember

that the implementation is allowed to do things you are not allowed

to do.

It was my implementation, based on unravelling the "while(*s)s++" loop.

What above? Please quote enough of the message you are replying to for
us to see what you are talking about. There is an option that gets
Google to do the right thing and if you search the group I'm sure you

will find the instructions. It's in someone's sig, but I can't remember who.

An implementation could declare that or not for any reason it wants.

If _OPTIMIZED_FOR_SIZE is declared, the implementation tries to unravel
the "while(*s)s++" loop somewhat.

Again, a compiler could declare that or not as it saw fit.

There's no way to portably detect whether a pointer-to-char is aligned
on a long boundary, is there?

There is no guarantee that this will tell you if it is aligned. Some
people around here have worked on word addressed systems where the byte
within the word was flagged in the *high* bits of the address.

I bet that makes for some funky internal pointer arithmetic!

Misleading name, I initially read that as a screwy attempt to detect a
NULL pointer. DETECTNULCHAR would be better.

Isn't that macro you are defining in the implementation name space?
Anything could happen.

I tried two types of optimizations, one for time (try to unravel the
loop) and one for size. If I don't get a kind of system where casting
a pointer-to-char to a pointer-to-unsigned-long doesn't make much
sense, #defining _OPTIMIZED_FOR_SIZE allows me to leave out code that
wouldn't work in that situation.

The above could read bytes off the end of a properly nul terminated
string. For example,
size_t len = strlen("a");

I'm testing for having a null character somewhere among the characters
that make up the area that aligned_addr points to. If I don't get a
sane environment (as indicated by the _OPTIMIZED_FOR_SIZE macro), this
code isn't even compiled in.

Here's the general idea: suppose, for example, sizeof(unsigned long) is
4. I can freely cast a pointer-to-char to a pointer-to-unsigned-long. I
don't care if *aligned_addr is big-end-aligned or little-end-aligned.
Oh, well, is there a better way to unravel "while(*s)s++"?

No need to cast the result of the subtraction. The compiler already
knows is is returning a size_t so will do the conversion anyway.

The cast is only for my eyes. ;-)

Gregory Pietsch

Keith Thompson · Apr 25, 2005

Lawrence Kirby said:
It is part of the implementation which covers both compiler and library.
Many compilers can generate their own inline code for strlen() in which
case the "library" as a separate concept has little to do with it.

You're right. I should have said that strlen() is *typically
implemented as* part of the runtime library, not part of the compiler.
(I don't know how many compilers generate inline code, and therefore
how accurate "typically" is.)

Christian Bau · Apr 25, 2005

"roy said:
Thanks. Maybe my question should be "what if the input is a char array
without a null terminator". But from my experimental results, it seems
that strlen can still return the number of characters of a char array.
I am just not sure whether I am just lucky or sth else happened inside
strlen.

You are not lucky, you are unlucky.

If you were lucky, your program would crash as soon as try this, and
then you would know there is a bug that needs fixing. If you are
unlucky, you get a result that doesn't show the bug.

Tim Prince · Apr 26, 2005

Keith Thompson said:
You're right. I should have said that strlen() is *typically
implemented as* part of the runtime library, not part of the compiler.
(I don't know how many compilers generate inline code, and therefore
how accurate "typically" is.)

Several common compilers, both commercial and free software, have both
in-line and library implementations, as provided for in standard C (both C89
and C99). In normal usage, not allowing for both possibilities would open
up the possibility of Undefined Behavior.

Chris Torek · Apr 27, 2005

There's no way to portably detect whether a pointer-to-char is aligned
on a long boundary, is there?

No (at least, not if by "portable" you mean what we usually do in
comp.lang.c

... there are versions that are "portable" to those
systems that define an alignment function or macro, such as all
the BSD variants).

[code using things like]

I tried two types of optimizations, one for time (try to unravel the
loop) and one for size. ...

Here's the general idea: suppose, for example, sizeof(unsigned long) is
4. I can freely cast a pointer-to-char to a pointer-to-unsigned-long. I
don't care if *aligned_addr is big-end-aligned or little-end-aligned.
Oh, well, is there a better way to unravel "while(*s)s++"?

Maybe, maybe not. It is quite CPU-dependent.

For whatever it is worth (perhaps not much at this point), I tried
the above trick in SPARC assembly code when I was writing the 4.4BSD
C library routines for the SPARC. (I wrote many of the "portable"
routines as well; we set things up so that when you built for VAX,
Tahoe, or SPARC, you got either the machine-specific version or the
generic, depending on whether we had written a machine-specific
version.)

The result was that the fancy version using "four byte at a time"
scans (on aligned pointers) was significantly *slower* than the
dumb, simple, one-byte-at-a-time version, even for relatively long
strings. I was a bit surprised; and the results might be different
on a more modern CPU (this was back in 1991 or so).

(I wrote the whole thing in assembly -- well, in C at first, compiled
to assembly, then hand-edited -- so I know it was not the compiler
doing anything tricky, either.)

It turns out that in most C programs, most strings are very short.
The "Dhrystone" tests that many people used to use to compare C
library implementations use strings that are significantly longer
than average, and overemphasize the time behavior of strlen(),
strcpy(), and strcmp() on relatively long strings. Even for these
longer strings, the "optimized" strlen() was still slower.

Of course, this "most C strings are short" rule of thumb may come
about because most C libraries are optimized for short strings
because most strings are short because most C libraries are optimized
for short strings, etc.

In other words, if you have a lot of
long strings, and you do program optimization, you will avoid
calling strlen() on them so much.

Even if one breaks this initial chicken-and-egg loop (by calling
strlen() repeatedly on long strings), and then optimizes the heck
out of strlen(), one can probably still speed up one's programs by
fixing the repeated calls to strlen(). There is another rule of
thumb that applies beyond just C programming, or even computers:

The shortest, fastest, cheapest, and most reliable parts of
any system are the ones that are not there.

(This is another way of putting the "KISS" principle. Of course,
marketing usually gets in the way of this idea.

)

Peter Ammon · Apr 27, 2005

Keith said:
Stan Milam said:

Keith said:

On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"

[...]

But from my experimental results, it seems
that strlen can still return the number of characters of a char array.

Click to expand...

[...]

I am just not sure whether I am just lucky or sth else happened inside
strlen.

lucky

No, if he'd been lucky it would have crashed the program (with a
meaningful diagnostic) rather than quietly returning a meaningless
result.

Click to expand...

So, you are saying this is a poorly implemented compiler?

Click to expand...

Not at all.

First, strlen() is part of the runtime library, not part of the
compiler.

An implementation of strlen() that was able to detect the case where
the argument points to the first element of an array that doesn't
contain any '\0' characters would most likely add significant overhead
to *all* operations. The obvious way to implement it is to make all
pointers "fat", so each pointer includes both the base address and
bounds information; strlen() would then have to check the bounds.

A simpler way would be to insert a padding byte containing zero after
every char array.

-Peter

CBFalconer · Apr 27, 2005

Chris said:
.... snip ...

Even if one breaks this initial chicken-and-egg loop (by calling
strlen() repeatedly on long strings), and then optimizes the heck
out of strlen(), one can probably still speed up one's programs by
fixing the repeated calls to strlen(). There is another rule of
thumb that applies beyond just C programming, or even computers:

The shortest, fastest, cheapest, and most reliable parts of
any system are the ones that are not there.

(This is another way of putting the "KISS" principle. Of course,
marketing usually gets in the way of this idea. )

My suggestion is to try to return the length from most routines
that uncover it, in place of an insipid pointer to the original
string. strlcpy and strlcat follow this practice. So do printf
and sprintf.

pete · Apr 28, 2005

Stan said:
Okay guys, that was a joke.

No, it wasn't.
Your posts in the "C FAQ 3.1" thread show that you don't see
the beauty of the concept of undefined behavior.

If you're going to write bad code,
then the C standard committee doesn't care about
what happens as a consequence.

This philosophy was in C originally,
and is maintained in the current C99 standard.

It's not that R was in too much of a hurry specifying C,
so that he didn't have enough time
to also specify what garbage code should do,
but rather it's the case that compiler writers
are in too much of a hurry writing compilers
to want to care about how to translate garbage code.

pete · Apr 28, 2005

Gregory said:
There has to be a null terminator somewhere.

Here's a short implementation:

#include <string.h>
size_t (strlen)(char *s)
{
char *p = s;

while (*p != '\0')
p++;
return (size_t)(p - s);
}

The ptrdiff_t type of (p - s) disqualifies this code
from being an example of portable C code.

If the following description of undefined behavior doesn't
apply to your code, then it doesn't apply to anything.

N869
6.5.6 Additive operators
[#9] When two pointers are subtracted, both shall point to
elements of the same array object, or one past the last
element of the array object; the result is the difference
of the subscripts of the two array elements. The size of
the
result is implementation-defined, and its type (a signed
integer type) is ptrdiff_t defined in the <stddef.h> header.
If the result is not representable in an object of that
type, the behavior is undefined.

If(strcmp(str, "") == 0) - What does this line of code mean?	0	Aug 8, 2022
WIN32 - Update Text in a Window in order to show its size in Pixels and coordinates	0	Oct 4, 2023
array-size/malloc limit and strlen() failure	26	Apr 2, 2014
C pipe	1	Dec 9, 2021
For loop and strlen	8	Sep 20, 2009
How to accept text and put each letter into a 2d matrix?	0	Jun 3, 2022
STRING - Remove small letters from string	1	Jan 20, 2023
Single put routine overlapping words during iteration	4	Jan 2, 2023

How is strlen implemented?

Keith Thompson

James McIninch

Mark McIntyre

Emmanuel Delahaye

Emmanuel Delahaye

Emmanuel Delahaye

Stan Milam

Gregory Pietsch

Gregory Pietsch

Flash Gordon

Lawrence Kirby

Gregory Pietsch

Keith Thompson

Christian Bau

Tim Prince

Chris Torek

Peter Ammon

CBFalconer

pete

pete

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads