Increasing efficiency in C

D

Dan Pop

In said:
As everybody knows, C uses a zero delimited unbounded
pointer for its representation of strings.

If I didn't already know what a C string is, I wouldn't have been able
to make any sense out of this sentence.
This is extremely inefficient because at each query of the
length of the string, the computer starts an unbounded
memory scan searching for a zero that ends the string.

It doesn't hurt to use your common sense in validating your opinions.
If C strings were "extremely inefficient", that would have been a much
bigger problem 30 years ago, when computers were orders of magnitude
slower than today. Yet, no one produced a fix then. No alternate
string libraries designed and implemented for C since then have
acquired any kind of popularity.

Since C programmers aren't the last people to care about efficiency,
what conclusion can you draw?

Dan
 
N

Nils Petter Vaskinn

"Richard Bos" <[email protected]> a écrit dans le message de


There is no free lunch.
Yes, it will cost at least size_t bytes more than a zero terminated
string. So what?
Many applications can afford this.

There it is.

Many != all.

Your proposal would impose that cost on all applications. Leaving things
as they are would allow applications to store the length alongside the
string if they choose to. (or link against a library that provides such a
string type and replacement string manipulation functions)
 
R

Richard Bos

=?ISO-8859-1?Q?Tor_Husab=F8?= said:
Meaning you had to keep track of the length in a separate variable?

Meaning original Pascal programs had to be written for fixed-length
strings, and each different fixed length would need a different set of
functions. Ditto for arrays of other types. And verily, it suckéd
mightily, and there was much wailing and gnashing of teeth. Forwhich
conformant array schemas were invented. And verily, it did still suck
mightily, but not nearly as mightily as before.

Richard
 
R

Richard Bos

jacob navia said:
Richard Bos said:
struct string {
size_t length;
char data[];
};
But in that case, you'll hit the second case: now, you're using four
bytes, maybe eight, to address a single string, AOT C's one-byte null
terminator. Not a problem, perhaps, if all your strings are dozens of
kilobytes, but most applications seem to use lots and lots of small
strings, every single one potentially costing you seven bytes extra, and
only a few large ones.

There is no free lunch.
Yes, it will cost at least size_t bytes more than a zero terminated string.
So what?
Many applications can afford this.

This is the attitude that gave us Access and Outlook. And people wonder
why modern programs are bloated and sluggardly...
idea is to use a string type that is length prefixed and
allows run-time checking against UB: undefined
behavior.

How? What if I want a 30-char array, of which the first 20 to 29 chars
contain a string, and the last char contains a checksum? Modifying the
checksum wouldn't be undefined behaviour at all, but it would write
beyond data[length-1].

You can implement check sums in my schema.

Not if you want to use run-time checks on the _string_ length rather
than on the _array_ length.
Current strings are used using the [ ] notation. This strings
could have the same privilege isn't it?

No, they couldn't. Puzzle for you: how would you extend the
interoperation between strings and char pointers, using your
length-strings?

I have started a library

Library, schmiblary. In C, it is a simple matter of assigning a pointer.
Who wants to use a schema that only makes serious programming more
difficult?

And frankly, that sums up my reaction to this idea. It would make
programming with strings in C considerably more hassle-ful, to, as far
as I can tell, an entirely ephemeral speed advantage, which is more than
likely to be offset by the more complicated and time-consuming code
needed to handle these strings in any other circumstance than finding
their length - an operation which isn't performed as often as you'd
tihnk, in quality code.

Richard
 
J

jacob navia

Dan Pop said:
In <[email protected]> "jacob navia"

If I didn't already know what a C string is, I wouldn't have been able
to make any sense out of this sentence.

Zero delimited means a zero is written in the last byte to signal the end of
the string Dan
I wanted to emphasize "unbounded" because there is no way to know if
the zero is not there where the pointer will end pointing to...

It doesn't hurt to use your common sense in validating your opinions.
If C strings were "extremely inefficient", that would have been a much
bigger problem 30 years ago, when computers were orders of magnitude
slower than today. Yet, no one produced a fix then. No alternate
string libraries designed and implemented for C since then have
acquired any kind of popularity.

There are many Dan. Just search in Google and you will find zig libraries
that implement this with different emphasis in different objectives.
The objective of this discussion is to see why the *language* doesn't
support any other schema for implementing strings.
Since C programmers aren't the last people to care about efficiency,
what conclusion can you draw?

Since language support doesn't encourage the use of bounded pointers
C string handling is much more error prone than it should be.

Never had the traps because of the missing zero?

The failure modes of the string functions in the library like strcpy
are just horrible. Memory corruption is guaranteed unless a zero
is found...
 
L

Leor Zolman

Common, and used, by whom? C++ programmers, I suspect

Well, yes...why would C progammers bother coming up with pet names for
/any/ particular usage style of C++?
. _My_ term for
such a hybrid monstrum would be "A Bloody Mess".

If you want C++, be a man and program in C++. Don't go pretend that
you're almost using C.

I think it makes perfect sense, if you're going to make the move from C to
C++, to start by learning one feature (or just a few features) at a time.
What's wrong with the first one to focus on being, say, std::string ?
-leor

Leor Zolman
BD Software
(e-mail address removed)
www.bdsoft.com -- On-Site Training in C/C++, Java, Perl & Unix
C++ users: Download BD Software's free STL Error Message
Decryptor at www.bdsoft.com/tools/stlfilt.html
 
D

Dan Pop

In said:
Meaning you had to keep track of the length in a separate variable?
Sounds unlikely...

Meaning that the string length was declared explicitly (the size of the
character array containing the string) or implicitly (the number of
characters in the string literal) and it was the compiler that kept
track of it. Also meaning that the original Pascal strings were a
lot less flexible than C strings. Particular Pascal implementations
(e.g. Turbo Pascal) extended the flexibility of the Pascal strings,
by introducing a character count.

The canonical example of language using counted strings is BASIC, not
Pascal. But, even in BASIC's case, it was not part of the (original)
language specification.

Dan
 
N

Nils Petter Vaskinn

There are many Dan. Just search in Google and you will find zig libraries
that implement this with different emphasis in different objectives.
The objective of this discussion is to see why the *language* doesn't
support any other schema for implementing strings.

Because then the language and/or compiler writers would have to choose one
of those implementations, instead of allowing the language user the choice.

Having one or more string implementations with replacement string handling
functions can be a good idea while choosing one as part of the language or
standard library isn't.
 
D

Dan Pop

Zero delimited means a zero is written in the last byte to signal the end of
the string Dan

Nope, it means a zero before the string and another zero after. Delimited
is not a synonym for terminated.
I wanted to emphasize "unbounded" because there is no way to know if
the zero is not there where the pointer will end pointing to...

You don't know where the pointer will end pointing to. Your wording
simply didn't make any sense to anyone but you.

The representation of a string in C is the sequence of characters, up to
and including the null terminator. No kind of pointer is involved in the
representation of a C string.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are many Dan. Just search in Google and you will find zig libraries
that implement this with different emphasis in different objectives.

Are you reading impaired or what? Which of them qualifies as popular?
The objective of this discussion is to see why the *language* doesn't
support any other schema for implementing strings.

No other scheme proved to by better in a GENERAL PURPOSE context.
As you admit yourself, the alternate libraries are designed for well
defined goals, rather than as universal replacements for the C strings.

And the very existence of these libraries proves that the C language DOES
support alternate schemes. So, your point is moot.
Since language support doesn't encourage the use of bounded pointers
C string handling is much more error prone than it should be.

1. This is not a performance issue.

2. This is a *general* problem of C: most C features are error prone in
the hands of the incompetent.
Never had the traps because of the missing zero?
Nope.

The failure modes of the string functions in the library like strcpy
are just horrible. Memory corruption is guaranteed unless a zero
is found...

Dynamic memory allocation has exactly the same problems: write beyond
a dynamically allocated memory bolck (in either direction) and memory
corruption will (most likely) bite you, sooner or later. What is your
better replacement for malloc and friends?

C is a sharp tool *by design*. People who can't use sharp tools or are
afraid of them, should not use C. There are plenty of other programming
languages designed for them so there is no need to turn C into a less
sharp tool (and, therefore less effective in the hands of the competent
programmers) and annoy C's *intended* user base.

There are many ways in which C needs to be extended, but adding more
string formats is not one of them. You're wasting your time trying to
fix something that isn't broken.

Dan
 
R

Rob Thorpe

Mike Wahler said:
You needn't use all of it. You seem to be wanting
a 'real' string type, which C++ has, and it's not
at all difficult to use. Actually, if such a type
is needed, imo that's a good enough reason to use C++
(even if everything else is only the common subset
of the two languages).


So don't use 'em.


I use the parts I find useful, discard the rest.


Or ignore it. Simple, huh?

The problem is you just can't ignore them, because on a project of any
size others will start to use them.

Do you think it's practical to use a subset of C++ for anything
outside of your own personal code, or maybe code developed by a couple
of people. Maybe in an organisation with very strict procedures.
 
A

Alan Balmer

This is extremely inefficient because at each query of the
length of the string, the computer starts an unbounded
memory scan searching for a zero that ends the string.

But asking for length is not the only thing we do with strings, or
even the most common.
A more efficient representation is:

struct string {
size_t length;
char data[];

Perhaps more efficient for some uses, but less efficient for others.

As you probably know, there are languages which use this type of
string representation. And of course, there is nothing to prevent you
writing your own string-handling library based on this representation.
In fact, that would be an excellent exercise.
 
B

Ben Pfaff

You might get away with this on systems where sizeof(size_t) ==
sizeof('\0'). IOW, an embedded device, most likely.

sizeof(size_t) == sizeof('\0') is very common. For example, it
is true on most "32-bit" systems.

Maybe you meant sizeof(size_t) == sizeof(char).
 
C

CBFalconer

Dik T. Winter said:
Eh? The "Pascal-style"? In the only Pascal I ever used (on the
CDC Cyber, the original Pascal from ETH Zuerich), a string was
implemented as a sequence of characters, and that was it. Nearly
the same as in C, except that there was no terminator.

And that all "strings" have fixed size, known to the compiler at
compile time. They also are indexed 1..length. This still
applies to ISO7185 level 0, but ISO10206 has an extended string
type.
 
D

Derk Gwen

# As everybody knows, C uses a zero delimited unbounded
# pointer for its representation of strings.
#
# This is extremely inefficient because at each query of the
# length of the string, the computer starts an unbounded
# memory scan searching for a zero that ends the string.
#
# A more efficient representation is:

Been there, someone else's done that, and all I got is this lousy T-Shirt
with Tcl_Obj* on it.

There are lots of libraries that provided counted strings, and you can
load these with your code.
 
J

jacob navia

={0x67dffdff,0x9aa9aa6a,0xa77ffda9,0x7da6aa6a,0xa67f6aaa,0xaa9aa9f6,0x1f6},*
p=
b,x,i=24;for(;p+=!*p;*p/=4)switch(x=*p&3)case 0:{return 0;for(p--;i--;i--)case
2:{i++;if(1)break;else default:continue;if(0)case[/QUOTE]
1:putchar(a[i&15]);break;}}}

This is wrong Ben, since you get a "missing return value" warning :)

I would change the sig to this:
char a[]="\n .CJacehknorstu";int putchar(int);int main(void){unsigned long
b[]
={0x67dffdff,0x9aa9aa6a,0xa77ffda9,0x7da6aa6a,0xa67f6aaa,0xaa9aa9f6,0x1f6},*
p=
b,x,i=24;for(;p+=!*p;*p/=4)switch(x=*p&3)case 0:{return
0;for(p--;i--;i--)case
2:{i++;if(1)break;else default:continue;if(0)case
1:putchar(a[i&15]);break;}}return 0;}

See the return 0 at the end???

jacob
 
B

Ben Pfaff

jacob navia said:
={0x67dffdff,0x9aa9aa6a,0xa77ffda9,0x7da6aa6a,0xa67f6aaa,0xaa9aa9f6,0x1f6},*
p=
b,x,i=24;for(;p+=!*p;*p/=4)switch(x=*p&3)case 0:{return 0;for(p--;i--;i--)case
2:{i++;if(1)break;else default:continue;if(0)case
1:putchar(a[i&15]);break;}}}

This is wrong Ben, since you get a "missing return value" warning :)[/QUOTE]

So? It's C99.
 
P

Peter Ammon

jacob said:
As everybody knows, C uses a zero delimited unbounded
pointer for its representation of strings.

This is extremely inefficient because at each query of the
length of the string, the computer starts an unbounded
memory scan searching for a zero that ends the string.

A more efficient representation is:

struct string {
size_t length;
char data[];
};

The length operation becomes just a memory read.
This would considerably speed the programs. The basic
idea is to use a string type that is length prefixed and
allows run-time checking against UB: undefined
behavior.

Comparing strings is speeded up also because when
testing for equality, the first length comparison tells
maybe the whole story with just a couple of
memory reads.

A string like the one described above is not able to
resize itself. Any pointers to it would cease to be valid
when it is resized if the memory allocator is forced to
move memory around. The block where that string was
allocated is bounded by another blocks in memory, and
it is not possible to resize it.

A pointer ( an indirect representation) costs a sizeof(void *)
but allows to resize strings without invalidating the pointers
to them.

struct string {
size_t length;
char *data;
};

There is no compelling reason to choose one or the other.
It depends on the application. In any case, the standard
library could be complemented by
Strcmp
Strcpy
etc., all using length prefixed strings.

Syntactic sugar.

I have added some sugar to this coffee. I always liked coffee
with a bit of sugar. I feel that is too acid without it.

Current strings are used using the [ ] notation. This strings
could have the same privilege isn't it?

The language extension I propose is that the user has the right to
define the operation [ ] for any data type he/she wishes.

Not a big deal for today's compilers.

Length checked strings can then use:

String s;
...
s[2] = 'a';

I think I am proposing the obvious.

Do you agree?

jacob

I don't understand why everyone is comparing this to C++. The obvious
parallel is to Pascal, which used exactly this sort of string
representation.
 
J

jacob navia

I don't understand why everyone is comparing this to C++. The obvious
parallel is to Pascal, which used exactly this sort of string
representation.

Yes. Indeed. And it was a good thing, since bound checked strings
are NECESSARY, specially in the development phase of the program.

jacob
 
D

Default User

Dan Pop wrote:
Particular Pascal implementations
(e.g. Turbo Pascal) extended the flexibility of the Pascal strings,
by introducing a character count.

As I recall, Turbo Pascal stored the size of the string in the first
character slot, thus limiting the max string length to 255 (on the
DOS-based platform I used).

Not very handy if you needed longer strings than that.


Brian Rodenborn
 
A

Arthur J. O'Dwyer

Eh? The "Pascal-style"? In the only Pascal I ever used (on the CDC
Cyber, the original Pascal from ETH Zuerich), a string was implemented
as a sequence of characters, and that was it. Nearly the same as in C,
except that there was no terminator.

For the record and FWIW, I was thinking of Turbo Pascal's strings;
TP3.0 was one of the first languages I learned.

-Arthur
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,141
Messages
2,570,817
Members
47,362
Latest member
ChandaWagn

Latest Threads

Top