pemo said:
I believe that in languages like Pascal, the length of a string is encoded
in some sort of 'header' data structure, and that if a programmer wants to
know the length of such a string, the resultant discovery is therefore very
fast.
Yes, but most Pascal implementations I am aware of have a string length
limitation of 255 characters.
In C, functions like strlen() have to traverse the string in order to find
the null terminator. This requires a comparison of each char in the string
to '\0' of course.
Correct. Any code that contains strlen(), strcat() or other "traverse
to compute length" operations you know is definately suboptimal. I.e.,
if someone claims some code is made for performance and they are
calling one of those function in the critical path, you know they are
wrong.
Therefore, string functions (the majority I can think of all require some
'end searching') are necessarily slower than their Pascal counterparts.
Yes, that's certainly true for anything that includes string length
computations.
Which, seems to go against a bit of C philosophy - speed is everything!
Since when has C's philosophy been speed is everything? C was designed
by AT&T hackers specifically for the purpose of making UNIX -- it
didn't really have any other design purpose. UNIX is not an operating
system with any significant focus on strings, and it should not be
surprising that C's poor performance on strings has no effect on UNIX's
performance.
Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.
This is not going to be helpful when used in conjunction with pointer
arithmetic. What do you expect to happen here:
strcat (dest + prefix, src); /* Suppose we know strlen(dest) >
prefix in length. */
if prefix is just some offset that we know is in the middle of dest?
If you write the length into prefix bytes it will overwrite contents
that dest points to.
However, this would impose a limit on a string's length - but, surely, a
practical one!?
Well, even the '\0' delimited strings are limited in size (essentially
by SIZE_MAX, which may actually one day be defined on your platform.)
On real platforms, int, is well large enough to hold any practical
string. You could choose size_t as the length if you really want to
use every ounce of memory assignable to a single object in C.
Would such an implementation break the C standard?
Yes. None of the pointer arithmetic could work the same as it
currently does.
Does anyone know of an implementation that perhaps does something like this?
Try "The Better String Library" (
http://bstring.sf.net/). It uses
length prefix semantics, and derives a large average performance
advantage because of it, just as you suspect. But there are other
things in the library that enhance performance. You can see benchmark
data here:
http://bstring.sd.net/features.html#benchmarks . Hopefully
these benchmarks should suggest to you that "speed is everything" is
not a philosophy that has been actually carried out in the C language.
A common complaint with other string libraries which are not char *
based (and sometimes erroneously cited as a reason to avoid Bstrlib) is
that they don't support pointer arithmetic, and therefore lose out on
some performance tricks because of it. Bstrlib actually supports a
feature superior to pointer arithmetic, which you could call "segment
arithmetic". Basically, with pointer arithmetic, you can generate a
quick reference to any tail of any string which you can use as
parameter to the C string library functions. With Bstrlib, you can
reference any string segment (head, tail, or any interior segment) and
use it as a source string parameter to any Bstrlib function. I.e.,
hacked up tricks that you could pull with C strings, are now well
defined activities, that are more generalized in Bstrlib. To try to
pull the same trick in C, you would have to (even if temporarily)
modify the source string (inserting '\0's) which means that you could
only have 1 set of non-nested substrings active at once (there is no
such limitation in Bstrlib).
Bstrlib also comes with functions like biseq() which compares a pair of
strings, but does not compute their ASCII ordering. This allows the
massive performance accelerator of first checking for length
difference. (It also checks for content aliasing; something that could
be done in standard C libraries, but generally is not). The
architecture of C's '\0' terminated strings means this performance
enhancement simply is not available to it.
Oh yeah, and it doesn't break the ANSI C standard -- it defines a new
string type, and new complement of functions for it.
There are other features having to do with functionality, ease of use,
iteroperability with char * strings, semantic portability, and safety
that makes Bstrlib a compelling library for string manipulations as
well, which you can read about at the website.