New document

F

Friedrich Dominicus

CBFalconer said:
Something nobody seems to bother to notice is that C strings tend
to be short, so that execution of strlen() and similar on them is
not a bind. Somebody might care to instrument their actual use of
strlen so as to report the average (and possibly maximum) length at
program run conclusion. I am not about to bother to do so.
You are proabably right about that. However you can check
yourself. Just write a small string and add to it one char after the
other. Of course you will "tune" it e.g that you allocate in larger
chunks or the like, but on the other hand collecting a string is
probalby not a thing you do seldom, expecially now in the context of
"Web" programming. So you can probably burn a lot of time in strxxx
functions

Regards
Friedrich
 
R

Richard Heathfield

CBFalconer said:
Something nobody seems to bother to notice is that C strings tend
to be short, so that execution of strlen() and similar on them is
not a bind.

To be more precise, strings in general tend to be short, and C strings are
no exception to this. I suspect that the distribution *range* is actually
very wide indeed, but that the distribution itself extraordinarily skewed
towards the low end.
 
A

Andrew Poelstra

Robert Latest a écrit :

Well, I say that strlen will be always slower with zero terminated
strings and almost a NOP with length delimited strings.

strlen must always scan the characters to find the zero.

A length delimited string can just return the length immediately.
Since we're making assumptions about how functions work, here's an
idea: suppose that your length variable is stored on disk in swap;
loading that variable is going to take longer than the processor
using its predefined instructions (such as CMPSB) blowing through
a string.

Also, you have to consider how much more memory it takes to store
the lengths of strings, when it has been shown that in most cases
the lengths is /irrelevant/.
 
J

jacob navia

Andrew Poelstra a écrit :
Also, you have to consider how much more memory it takes to store
the lengths of strings, when it has been shown that in most cases
the lengths is /irrelevant/.

Irrelevant of course. Completely irrelevant. All buffer overflows were
coded that way:

Who cares about how long that string is exactly?

It will fit in a 512 byte buffer anyway!
 
H

Herbert Rosenau

Universally, by simply downloading and compiling:

<http://cbfalconer.home.att.net/download/strlcpy.zip>

and I mean universally. Those are written in standard C,
deliberately do not use any routines in the standard library, are
re-entrant, so are quite suitable for the most resource limited
embedded systems as well as anything else.
Hy compiler comes with 2 different standard libraries:
- not thread save; designed for single thread apps
- complete thread save because most apps will need that
for theyr requirements - been multithreaded

So you tells the linker only which implementation you use, single or
multithreaded. That does NOT stopping the compiler to inline functions
from the standard library whenever possible.

--
Tschau/Bye
Herbert

Visit http://www.ecomstation.de the home of german eComStation
eComStation 1.2 Deutsch ist da!
 
H

Herbert Rosenau

You are proabably right about that. However you can check
yourself. Just write a small string and add to it one char after the
other. Of course you will "tune" it e.g that you allocate in larger
chunks or the like, but on the other hand collecting a string is
probalby not a thing you do seldom, expecially now in the context of
"Web" programming. So you can probably burn a lot of time in strxxx
functions

No. I would simply not use a single of the str...() functions on that
- and I do not so in the editor I develope because when I have to
split a string in multiple ones or concatenate multiple strings inow
one single one it will be done by attaching each chare only once in
the whole run instead of wobble for- and backwards multiple times
through the same bytes.

While I read in the data I have to handle I know the maximum size I
need, setting pointers to the right areas while scanning through the
data makes quick acces to it easy and really quick while working on,
building the internal well normed output when it gets written to
external media the same occures.

getc()/putc() are a nice fuction to overcome the need to handle each
single charater by copying inside system, inside C runtime, inside
thousend of other fuctions. So there is no need to fiddle around with
...get....(), fread/frwite, fprint... and so on. getc() and when the
chare readed is not fit on the current pass unget and back to higher
level that will find it on its next get and do what is needed further.
No buffer who can overflow to will filled with unknown data, no need
to read a number of unknown bytes in unknown lenth, not even the need
of multiple ?alloc() with blind calculated size parameter.

Get the size of the file, malloc() a buffer for the size needed to get
it, malloc() the structs needed to maintain the encoded data and start
reading the file byte by byte. The runtime will internal reserve
enouth buffer space to make the read blocks/sectors/or whatever the
medium uses to organise a data stream most quickly from external
medium into memory so quick as possible.

When data has to be written the sames goes on: write byte after byte,
encode and format on the fly. There is nothing that can give you more
control over a stream and is more quickly done as to read/write on the
fly byte by byte instead to copy with or without help of a format
string from one location to another only to get it copied from there
to somewhere else only to get it copied ......

--
Tschau/Bye
Herbert

Visit http://www.ecomstation.de the home of german eComStation
eComStation 1.2 Deutsch ist da!
 
H

Herbert Rosenau

Andrew Poelstra a écrit :

Irrelevant of course. Completely irrelevant. All buffer overflows were
coded that way:

Who cares about how long that string is exactly?

You speaks about the pascal programs I had to rewrite _in C_ because
buffer overflow was the cause for 93% of all coredumps it had
produced.

A crazy programmer is in no ways hinderd to produce buffer overflows
in languages they check each and all statements for errors.
It will fit in a 512 byte buffer anyway!

Really you speaks about pascal or fortran.

When you have a problem with that simply use a language that checks
each and any what can be errornous, like addition, subtraction,
multiplication, division of int, these are the most causes for
unwanted and incorrect errors a program can produce. Buffer overflow
in C is always only the result of a dumb programmer who was unable to
do his job right, like Jacob Navia who has never learnded how to
program failsave.

--
Tschau/Bye
Herbert

Visit http://www.ecomstation.de the home of german eComStation
eComStation 1.2 Deutsch ist da!
 
A

Andrew Poelstra

Andrew Poelstra a écrit :

Irrelevant of course. Completely irrelevant. All buffer overflows were
coded that way:

Who cares about how long that string is exactly?

It will fit in a 512 byte buffer anyway!
That is the worst coding I have ever seen, mainly
because it is so common. However, fitting a string
into a buffer doesn't require you to know the string
length; it merely requires you to allocate more
memory every time yuou overflow the buffer.

Knowing a string length requires having a string,
which in turn requires memory already having been
allocated for it. You can't have the length before
the string, so your scenario has nothing to do with
the issue at hand.
 
K

Keith Thompson

jacob navia said:
Robert Latest a écrit :

Well, I say that strlen will be always slower with zero terminated
strings and almost a NOP with length delimited strings.

strlen must always scan the characters to find the zero.

A length delimited string can just return the length immediately.

And you need extra space to store the length, and you need to update
the length every time it might change.

Carefully written C code can avoid re-scanning strings in many cases,
and can also avoid storing the length when it's not needed.

I'm sure that storing the length of each string can improve
performance in some cases. This improvement would be greatest in
naively written code, things like:

for (i = 0; i < strlen(s); i ++) {
...
}

I also suspect that, for many applications, the overhead of
maintaining the length for each and every string, even when it's not
needed, could exceed the improvement from avoiding re-scanning,
particularly when (a) the application is carefully written, (b) most
strings are short, (c) the compiler is clever enough to hoist some
length calculations out of loops, and/or (d) the hardware is optimized
to work with NUL-terminated strings.

You seem to be assuming that storing the length for every string will
improve performance *by definition*. I don't believe that's true.

Have you *measured* the difference with any real applications? If
not, you're only guessing. (So am I, but I admit it.)
 
M

Malcolm

jacob navia said:
Robert Latest a écrit :

Well, I say that strlen will be always slower with zero terminated strings
and almost a NOP with length delimited strings.

strlen must always scan the characters to find the zero.

A length delimited string can just return the length immediately.
That's obviously right.
I can believe that, for non-trivial string processing, delimited stirngs are
better.

However there are certain problems. For instance, say we want to implement a
sub-string finder. strstr() returns a pointer to the first occurence of the
substring in its argument. How do we replace this by returning a
length-delimited string?

Do we make a local copy of the string? That seems pretty useless, and
expensive. Do we return a length-delimited string that points to the same
memory? The snag there is that when you extend or shrink the master string,
the sub-string has to update. You could say that lendelimited_strstr()
returns an integer to the first instance of the sub string. That's maybe the
answer, but then it puts the burden of constructing a usable substring on
the calling programmer.

By junking NUL-terminated strings, essentially you've destroyed the
cleanliness of the C string library.

The main problem with this whole venture, however, is political. If you were
Supreme Emperor of the West you could force the whole world to use lcc-win.
Maybe that would be a good thing. But since you are not, people don't want
to get locked into a standard that hardly anyone yet uses, and which may
eaisly fail. Also, there is a perception that using lcc-win would cede you
the status of Supreme Emperor. People will show submissive behaviour to
socially-dominant individuals, but not easily.
 
T

toby

jacob said:
...
Length delimited strings are INHERENTLY faster than zero terminated ones.

This fuzzy statement may give a clue to why you oversell them.

In fact, only "finding the length of" is faster. Any per-character
operation (copying, translating, searching, examples abound) is no
faster at all.

Why is finding the length faster? Because you've cached it. That option
is equally available to terminated string users. None of this is rocket
surgery...
 
G

Gordon Burditt

Obviously you think that scanning memory for the terminating zero is
vastly more efficient than accessing it directly with the string length.

Obviously you think that all operations for strings require the
length of the string, and that allocating memory and copying a
string costs nothing.

Describe how you'd implement the length-delimited-string version
of strchr() and strrchr(). They are not allowed to run out of
memory, calling malloc() is likely to be expensive, and memory leaks
are not allowed.

If p and p2 are strings, describe how you'd implement in
length-delimited-string what is done in C now with:

p[n] = '\0';
p2 = p+n+1;

That is, split a string creating two new ones.

How do you intend to access the Nth character of a string,
(given that you already know the string is that long),
currently done with:
p[N]
?
Each time you access the length of a zero terminated string you must
start that unbounded memory scan, source of countless errors. Operations

Each time you need to create a substring of a length-delimited-string,
and keep a copy of the original, you need to allocate more memory
(and not leak it), a source of even more countless errors.
like strcat depend on the length of the first string, that must be
recalculated over and over.

Obviously you have a different concept for "efficiency" than I do.

Length delimited strings are INHERENTLY faster than zero terminated ones.

For certain operations.

Gordon L. Burditt
 
J

jacob navia

Gordon Burditt a écrit :
Obviously you think that all operations for strings require the
length of the string, and that allocating memory and copying a
string costs nothing.

Describe how you'd implement the length-delimited-string version
of strchr() and strrchr(). They are not allowed to run out of
memory, calling malloc() is likely to be expensive, and memory leaks
are not allowed.

Let's do strrchr ok?

Here is the code for Strrchr:
StringpA EXPORT overloaded Strrchr(StringA & string, int element)
{
int i;
char *p;
StringpA result;

if (!StrvalidA(string))
return invalid_stringpA;
p=string.content + string.count - 1;
for (i = string.count; i >0; --i){
if (*p-- == element){
result.count = string.count-i;
result.content = string.content+i;
result.parent = &string;
}
}
return invalid_stringpA;
}

This function returns a structure describing a position within a String.
This structure is defined as follows:
typedef struct _stringpA {
size_t count;
char *content;
StringA *parent;
} StringpA;

This "fat" pointer defines a position within its parent string, that is
left untouched. Note that the field "content" is redundant since it must
be always the "content" field of the parent string + the count. It is
easier this way though. Probably I will take this field away, but for
now it doesn't matter.

Now consider this test program:
#include <str.h>
extern int _stdcall GetTickCount(void);
int main(void)
{
char buffer[2+64*1024];
int i,t1;

for (i=0; i<64*1024;i++)
buffer = 'M';
buffer[i++] = 'Z';
buffer = 0;
String &str = new_string(buffer);

t1 = GetTickCount();
for (i=0; i<10000;i++) {
Strchr(str,'Z');
}
printf("Strchr takes %d ms\n",GetTickCount() - t1);

t1 = GetTickCount();
for (i=0; i<10000;i++) {
strchr(buffer,'Z');
}
printf("strchr takes %d ms\n",GetTickCount() - t1);
}

This program builds a 64K buffer and at the end puts the character 'Z'.

The output of this program is:
Strchr takes 672 ms
strchr takes 1313 ms

Why?

Note that EACH CALL to Strrchr make an expensive call to StrvalidA to
validate its arguments. Still Strchr is TWICE as fast.

Since Strchr knows the length of the result string, it can start
AT THE END of the buffer and scan backwards! strchr however, can't do
this and must start at the beginning, working all the way to the end of
the buffer.

OF COURSE this is a very artificial example, but it shows you that
knowing the length of the string can be very useful in many situations
besides just strlen.

In the normal case however, strchr could be slightly faster, specially
if the strings are short.

jacob
 
M

Malcolm

jacob navia said:
Here is the code for Strrchr:
StringpA EXPORT overloaded Strrchr(StringA & string, int element)
{
int i;
char *p;
StringpA result;

if (!StrvalidA(string))
return invalid_stringpA;
p=string.content + string.count - 1;
for (i = string.count; i >0; --i){
if (*p-- == element){
result.count = string.count-i;
result.content = string.content+i;
result.parent = &string;
}
}
return invalid_stringpA;
}

This function returns a structure describing a position within a String.
This structure is defined as follows:
typedef struct _stringpA {
size_t count;
char *content;
StringA *parent;
} StringpA;

This "fat" pointer defines a position within its parent string, that is
left untouched. Note that the field "content" is redundant since it must
be always the "content" field of the parent string + the count. It is
easier this way though. Probably I will take this field away, but for
now it doesn't matter.

Now consider this test program:
#include <str.h>
extern int _stdcall GetTickCount(void);
int main(void)
{
char buffer[2+64*1024];
int i,t1;

for (i=0; i<64*1024;i++)
buffer = 'M';
buffer[i++] = 'Z';
buffer = 0;
String &str = new_string(buffer);

t1 = GetTickCount();
for (i=0; i<10000;i++) {
Strchr(str,'Z');
}
printf("Strchr takes %d ms\n",GetTickCount() - t1);

t1 = GetTickCount();
for (i=0; i<10000;i++) {
strchr(buffer,'Z');
}
printf("strchr takes %d ms\n",GetTickCount() - t1);
}

This program builds a 64K buffer and at the end puts the character 'Z'.

The output of this program is:
Strchr takes 672 ms
strchr takes 1313 ms

The example, as you admit, is contrived to show the superiority of the
length delimited string, by passing in a 64K argument with the target at the
end. However you have less than an order of magnitude speed up, less than 2
times, in fact.

What I would conclude from this is that attempts to speed up the string
library are rather a waste of time, and the focus should be on safety and
usability.
 
J

jacob navia

Malcolm a écrit :
The example, as you admit, is contrived to show the superiority of the
length delimited string, by passing in a 64K argument with the target at the
end. However you have less than an order of magnitude speed up, less than 2
times, in fact.

What I would conclude from this is that attempts to speed up the string
library are rather a waste of time, and the focus should be on safety and
usability.

Yes, you are right. I just wanted to say that strrchr is an example
where length delimited strings make a more perfomant algorithm possible,
but we are discussing here about some nanoseconds worth of CPU time.

I agree that the emphasis should be in safety and usability.

jacob
 
S

SM Ryan

# Malcolm a écrit :
# >
# > The example, as you admit, is contrived to show the superiority of the
# > length delimited string, by passing in a 64K argument with the target at the
# > end. However you have less than an order of magnitude speed up, less than 2
# > times, in fact.
# >
# > What I would conclude from this is that attempts to speed up the string
# > library are rather a waste of time, and the focus should be on safety and
# > usability.
#
# Yes, you are right. I just wanted to say that strrchr is an example
# where length delimited strings make a more perfomant algorithm possible,
# but we are discussing here about some nanoseconds worth of CPU time.
#
# I agree that the emphasis should be in safety and usability.

Unrestrained use of strcat can turn O(n) to O(n^2). Other string
functions don't change the order of magnitude; even if you have
to scan a string twice, that's a 2x slow down, not nx.
 
M

Malcolm

jacob navia said:
Here is the code for Strrchr:
StringpA EXPORT overloaded Strrchr(StringA & string, int element)
{
int i;
char *p;
StringpA result;

if (!StrvalidA(string))
return invalid_stringpA;
p=string.content + string.count - 1;
for (i = string.count; i >0; --i){
if (*p-- == element){
result.count = string.count-i;
result.content = string.content+i;
result.parent = &string;
^^^^^^^^^^^^^^^^^^^^^^^^^
return result; /* !!!!!!!!!!!!!! */
}
}
return invalid_stringpA;
}

The function is bugged. That was why it was giving an impossibly poor
performance gain on the problem tuned for it.
 
J

jacob navia

Malcolm a écrit :
^^^^^^^^^^^^^^^^^^^^^^^^^
return result; /* !!!!!!!!!!!!!! */



The function is bugged. That was why it was giving an impossibly poor
performance gain on the problem tuned for it.
It can be that it is bugged.
Can you specify?

What is bugged?

jacob
 
S

SM Ryan

# "jacob navia" <[email protected]
# >
# Here is the code for Strrchr:

char *strrchar(char *s,char c) {
char *r = 0;
for (;;) {
if (*s==c) r = s;
if (!*s) break;
s++;
}
return r;
}

Scanning backwards will only give a 2x speed up (if c is equally
likely anywhere s, on the average strlen(s)/2 characters would
have to be scanned instead strlen(s) characters this way.) Both
scanning directions are still O(strlen(s)) assuming no thrashing.

Scanninng very long strings (megabyte strings are not uncommon
in some modern programs) backwards risk thrashing since vm pagers
are not designed with backward paging heuristics as often as
forward paging heuristics. Also if there's hardware to do the
scanning, more likely it runs forward only.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,183
Messages
2,570,967
Members
47,517
Latest member
Andres38A1

Latest Threads

Top