Strings in C are less optimal than in (say) Pascal - correct?

J

jacob navia

Keith said:
jacob navia said:
pemo said:
I believe that in languages like Pascal, the length of a string is
encoded in some sort of 'header' data structure, and that if a
programmer wants to know the length of such a string, the resultant
discovery is therefore very fast.
[...]
Does anyone know of an implementation that perhaps does something
like this?

Yes, if you download lcc-win32 you will find an implementation of
strings using length prefixed strings. They are treated just like
normal strings, and a replacement for the string functions in the C
library is provided. The library source is distributed with lcc-win32.

http://www.cs.virginia.edu/~lcc-win32


Can this implementation be used with compilers other than lcc-win32?
Yes. Instead of writing
a[2] = 'a';

you should write

setString(a,2,'a');

and similar things. But in general yes, you could do it.

The objective of the library is to make porting old code very easy
and this can't be done in standard C.

I hope in the new standard edition operator overloading will be allowed,
together with generic functions. Only those two changes are needed to
allow for easy implementation of length delimited strings.
 
N

N. B. (substitute bay for gulf to e-mail)

jacob said:
I hope in the new standard edition operator overloading will be allowed,
<flame war='on'>You're absolutely insane.</flame>
 
D

Dave Thompson

I believe that speed *was* a goal - wasn't it used to originally write an
OS? One would assume that speed was certainly a goal in that case -
wouldn't one?
A goal but not a major one. Most Unix functionality was (still is)
concerned with I/O and similar external things (like clock ticks)
which are not (remotely) compute-bound. And the Unix kernel did very
little string processing (mostly buffers and blocks instead) -- the
only thing that springs to mind is filename/pathname processing on
open(), creat(), unlink(), etc., which are usually fairly infrequent.

Much more important AFAICT were low-level semantics down to direct
hardware access, (self) hosting on a fairly small system, and "you
only pay for, and get, what you see".


- David.Thompson1 at worldnet.att.net
 
D

David Mathog

Rob said:
pemo wrote:


Not genereally. Strlen is slower. Most of the others iterate across
the array of chars looking for the '\0' character at the end, rather
than finding it with strlen first.

strcat() can be awful in this regard. In many programs a file is
read one line at a time into a smallish temp buffer, cleaned up somehow,
and then what's left is concatenated into one huge buffer. Over the
years I've seen this in many, many applications, and it can cause
problems because a lot of grad student programmers do it the naive way
like this is:


/* your choice of begin loop structure */
read_chars=read_a_line(temp_buffer);
/* error checking for overflow, end of file, etc. omitted */
/* remove spaces and tabs (for instance) from temp_buffer */
strcat(huge_buffer,temp_buffer);
/* your choice of end loop structure */

The performance of this code is abysmal when reading in a lot of text
because strcat has to scan to the end of huge_buffer each time it wants
to append a temp_buffer. It's possible to work around this by keeping
track of the number of characters read and passing huge_buffer from that
point like:

read_chars=read_a_line(temp_buffer);
huge_buffer[total_chars]='\0';
strcat(&huge_buffer[total_chars],temp_buffer);
total_chars += read_chars;

but it's even easier to not call strcat at all and just append the
characters from temp_buffer to huge_buffer in a loop.

This is definitely a case where the lack of a "count" on C strings
combines with a standard function to produce really slow code.

With 20/20 hindsight it probably would have been better if C had
provided a string type with these 3 parts:

size_t allocated;
size_t count;
char *array;

That would have allowed strcat to have handled the preceding example
efficiently.

Regards,

David Mathog
(e-mail address removed)
 
K

Keith Thompson

David Mathog said:
This is definitely a case where the lack of a "count" on C strings
combines with a standard function to produce really slow code.

With 20/20 hindsight it probably would have been better if C had
provided a string type with these 3 parts:

size_t allocated;
size_t count;
char *array;

That would have allowed strcat to have handled the preceding example
efficiently.

Yes, that would have made that one example much more efficient, but
I'm sure there are plenty of cases that would be slowed down (because
it needs to update information that's never used), and others that
would become much more difficult (like treating a pointer into the
middle of a string as a pointer to the tail of the string).

It's easy enough to define your own string-like type with whatever
information you want. Perhaps the best thing would have been to
provide the relatively simple nul-terminated strings we have now, plus
a standard header with a more sophisticated dynamic string type.
There could even be a special kind of string literal, perhaps
something like
D"This is a dynamic string literal"
and
DL"This is a dynamic wide string literal"
or just a carefully defined set of implicit conversions.
 
M

Mabden

But you can "roll your own". Don't saddle me with your junk - just write
it for yourself and move on.

It's easy enough to define your own string-like type with whatever
information you want. Perhaps the best thing would have been to
provide the relatively simple nul-terminated strings we have now, plus
a standard header with a more sophisticated dynamic string type.

Or move to the newer languages like C# or, at least C++, which have
built in tools for handling strings. In C, we use char arrays. If you
don't prefer that, and want a string function, use a language that has
one.
 
W

websnarf

That looks somehow familliar ...
Yes, that would have made that one example much more efficient, [...]

And hundreds of others ... in fact, any "whole string" or "parts of
strings" kinds of algorithms are always faster (and a lot easier to
implement).
[...] but
I'm sure there are plenty of cases that would be slowed down (because
it needs to update information that's never used),

Usually, only algorithms that are character by character based, and in
which the compiler's optimizer fails to factor out redundant header
operations and the programmer doesn't intervene themselves. Its
possible to just straddle your L1 from extra headers, but that's a
pretty marginal situation.
[...] and others that
would become much more difficult (like treating a pointer into the
middle of a string as a pointer to the tail of the string).

In Bstrlib, this is *easier* not harder (you don't use pointers,
instead use a localled defined header). Its also faster, and has more
useful semantics that C's "point to tail" semantic just doesn't compare
with.
It's easy enough to define your own string-like type with whatever
information you want. Perhaps the best thing would have been to
provide the relatively simple nul-terminated strings we have now, plus
a standard header with a more sophisticated dynamic string type.

You don't say ...
There could even be a special kind of string literal, perhaps
something like
D"This is a dynamic string literal"
and
DL"This is a dynamic wide string literal"
or just a carefully defined set of implicit conversions.

Well yeah, that's convenient, however you can compare this to the
bsStatic() and bsStaticBlkParms() macros in Bstrlib. Part of the
problem is that C doesn't have "constructors" or "destructors", so such
strings are not actually dynamic.

In Bstrlib, if you want a writable bstring created and initialized in a
scope you just do one of these:

bstring b = blk2bstr (bsStaticBlkParms ("This is a writable dynamic
bstring"));
bstring c = bfromcstr ("This is a writable dynamic bstring");

(The second is slower, but usually the strings are too small for it to
matter.) And obviously you could define a macro like D() to wrap
those, but I wouldn't want to intrude into the namespace that harshly.
If the string is not writable, then the following is preferable:

struct tagbstring d = bsStatic ("This is a non-writable bstring");
 
R

Richard Tobin

David Mathog said:
With 20/20 hindsight it probably would have been better if C had
provided a string type with these 3 parts:

size_t allocated;
size_t count;
char *array;

[Off-topic]

This sort of thing is useful for dynamically sized vectors in general,
not just strings. I have a library of macros and functions that I use
for this, which work by storing the count and allocated values before
the start of the array. Functions that just access the vectors,
rather than extending them, can treat them as ordinary C arrays. In
particular, a vector of char can be used as a string.

I haven't given much thought to making this standard-conformant
(rather than just portable to the systems I care about). Perhaps some
of the experts in this group might comment on whether and how that can
be done.

Here's an example of the use of the library:

#include <stdio.h>
#include "ltvector.h"

int main(void)
{
int c;
LTVector(char, string);

LTVectorInit(string);

while((c = getchar()) != EOF)
LTVectorPush(string, c);
LTVectorPush(string, 0);

printf("%s\n", string);

return 0;
}

Here are definitions of some of the macros:

Declare a vector:

#define LTVector(type, name) type *name

Initialize the vector to empty (lt_vector_empty is a void * pointing
to the end of a static int[2] - I should be using size_t, not int):

#define LTVectorInit(v) (v) = lt_vector_empty

Here are the macros that access the size and allocation (I'm assuming
certain things about pointer implementation and alignment here):

#define LTVectorCount(v) (((int *)(v))[-1])

#define LTVectorAllocated(v) (((int *)(v))[-2])

These values are both zero for lt_vector_empty.

Push a value onto the vector, extending if necessary. Returns 1
normally, 0 if allocation failed:

#define LTVectorPush(v, value) \
(LTVectorEnsureAllocated((v), LTVectorCount(v)+1) ? \
((v)[LTVectorCount(v)++] = (value), 1) : 0)

The LTVectorEnsureAllocated macro does nothing if there's enough
space, otherwise it calls a function to realloc the vector (and that
reallocation also makes assumptions about alignment):

#define LTVectorEnsureAllocated(v, n) \
((n) <= LTVectorAllocated(v) || \
LTVectorEnsureAllocatedSlow(&(v), (n), sizeof((v)[0])))

You can't free these vectors with free(), because you aren't pointing
to the start of the allocated space:

#define LTVectorFree(v) LTFree((v) == lt_vector_empty ? 0 : (int *)(v)-2)

-- Richard
 
R

Rob Thorpe

Keith said:
Yes, that would have made that one example much more efficient, but
I'm sure there are plenty of cases that would be slowed down (because
it needs to update information that's never used), and others that
would become much more difficult (like treating a pointer into the
middle of a string as a pointer to the tail of the string).

Pretty much all the operations can be done faster with storage of
length instead of null termination. The key is that if storage of
length is used then a string must be a separate abstract type.
Once this is done all kinds of optimizations can be done. The situation
you mention above can be done without copying, strings can also be
clipped from their beginning as well as their end without copying.
Implementations of languages with length labelled strings often do
these things.

If C were designed again then strings with length would be a better
solution. But it is unlikely to be designed again.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,173
Messages
2,570,938
Members
47,474
Latest member
VivianStuk

Latest Threads

Top