strcmp but with '\n' as the terrminator

D

Dan Pop

In said:
This can be trivially avoided by using sprintf instead of strcat :)

Errm, what?

Say you have...

List *scan = NULL;
char buf[4096]; /* we "know" this is long enough */

buf[0] = 0;
scan = beg;
while (scan)
{
strcat(buf, scan->data);
scan = scan->next;
}

...how does sprintf() help? Ok, so you can do something like...

ptr = buf;
while (scan)
{
ptr += sprintf(ptr, "%s", scan->data); /* assume sprintf() has an ISO
* return value*/

We normally assume that standard library functions return what the
standard says they do. Without this assumption, the standard library
becomes (next to) useless.
scan = scan->next;
}

...but then you might as well just do...

ptr = buf;
while (scan)
{
size_t len = strlen(scan->data);

memcpy(ptr, scan->data, len);
ptr += len;

scan = scan->next;
}

Except that it requires more code and is, therefore, less readable and
that it requires one more statement, after the loop, to properly terminate
the string.
...and after you do that more than once you realize that you want...

char *my_stpcpy(char *dst, const char *src)

You can simply name it stpcpy(), especially since this is the name you
use below :)
{
size_t len = strlen(src);

memcpy(dst, src, len);
dst += len;

return (dst);
}

ptr = buf;
while (scan)
{
ptr = stpcpy(ptr, scan->data);
scan = scan->next;
}

...at which point you've just _reinvented the wheel_ for about the
millionth time, creating your own clumsy string API.

Which is pointless, considering that the sprintf-based solution achieves
the same thing, with the same source code complexity, while staying with
the standard API.
All because the c
library string APIs are deficient ... which is pretty much what was
argued.

The only defficiency I can see is that strcpy and strcat (and friends)
have a (mostly) useless return value. For the rare cases when this is
a problem, sprintf provides a solution without needing to reinvent
anything and without having to take the overhead of repetitive strcat()
calls (sprintf has its own overhead, but it is constant per call).

Dan
 
P

Paul Hsieh

(e-mail address removed) says...

By what standards can you say any of that? Buffer overflows are the #1
occurring bug, and the vast majority of them occurr in the C string
library.

Accidents are the #1 cause of death for people under the age of 35 in
the United States[1][2], and the vast majority of them are motor vehicle
accidents[3]. Does that mean we should stop using motor vehicles?

But there is only so much you can do about people who have accidents.
Furthermore things *ARE* done to minimize them. That's why cars have
bumpers, crumple zones, air bags and seat belts. That's why microwave
ovens can't operate without the door being closed. That's why razors
have bizarrely shaped enclosures around the blade. The infrastructure
evolves around the need to minimize accidents even if you could argue
that the accident was really the fault of the victim.

Compare this with the C language. In order to make it work and be
adopted, in 1989, compromises were made and lots of questionable
practices were rubber stamped. Ok fine -- for 1989 it was good
decision because it allowed the language to be rapidly and widely
implemented adopted. But in the 20+ year lifetime of this language,
we now know this language has serious problems. Nearly every hack,
most general program failures and every buffer overflow->stack hijack
attack can be traced back to the C standard.

Ok -- so what is to be done about this sad state of affairs? Simple,
do *something* whenever there is a standards revision. 1999 was the C
committee's perfect opportunity to do something, *ANYTHING* to try to
mitigate these problems. Even the single solitary act of deprecating
gets() would have at least been a signal that they were thinking about
these issues.

But no, they added in complex numbers that worsens C++ compatibility,
and numerous other irrelenvancies to codify "standard practice" for no
good reason. Not surprisingly, C99 has gotten no serious support from
any major vendor -- the closest thing is gcc, and they are still
working on it.
Can you give an example of a nontrivial string manipulation that doesn't
already have O(n) time?

My claim is that there is an *addition* O(n) paid. For those in
theoretical Comp. Sci., this may mean nothing to you if the operation
is O(n) anyways (especially if we ignore the fact the many operations
have an "m" as well as "n"), but Buffer Overflows, paging, and cache
thrashing probably don't mean anything to you either. In which case
real world performance won't mean anything to you either.
[...] I strongly suspect that anything you can come up
with could be done with the C string library with no additional overhead.

I don't claim there is no additional overhead. But all the overhead
is O(1).
C still exposes the best core speed for someone willing to work around
the compiler and pretty much the only useful language with inline assembly
language. So I am stuck with it.

Really? If you go right to assembly, you stop having to work around
the compiler (since there's no longer a compiler to work around), [...]

Look, I don't care whether or not you understand why C (+ assembly
sometimes) is the only real option for writing maintainable and high
performance software.
 
P

Paul Hsieh

Richard Heathfield said:
Kevin said:
Concatenation of a string of length m with a string of length n is
O(n+m) using strcat, but O(m) if you use a string type that has its
length explicitly stored, rather than indicated by a sentinel.

Using no additional overhead [1], remember

Remember?!?!? Remember where? In one of your processor's 6 precious
registers? You also have to *remember* how much memory you have
allocated and make sure you don't spill over as well, BTW. Oh yes,
and if you are communicating with a library are you going to pass
these remembered quantities around along with the string data? Or
will you let it work it all out with strlen by itself? Of course its
kind of hard to deduce the actual amount of memory from this
information so you either have to figure it all out from the caller
(thus duplicating some of the logic of the library) or you have to
pass it as a parameter (buring an additional register or stack.)

Or you could screw it and just buffer overflow like everyone else
does.
This can all be managed perfectly satisfactorily using C strings and
temporary variables.

Which my library (and others) is living proof of, of course. Of
course trying to do it all by hand youself without a centralize
library ... well you read about the weekly buffer overflow attacks
that get reported to www.securityfocus.com or Risks Digest or
www.news.com to see what happens when you try to do that.
 
P

Paul Hsieh

James Antill said:
But this says nothing about how good or bad the C-library string API is.

memcpy: Useful, but requires the programer to keep track of metadata for
dst.
memmove: Same as memcpy.
strcpy: Most commonly used for buffer overflows, as with all the str*
functions to create data the two inputs cannot be the same.
strncpy: Most broken interface ever
strcat: O(n)
strncat: O(n) Plus dst must be a valid NIL terminated c style string
memcmp: Useful, but requires the programer to keep track of metadata for
both arguments and properly merge them (you can "fix" having to
merge the metadata by using strncpy() but I wouldn't recommend
this).
strcmp: Useful, assuming you have valid c style strings.
strcoll: Same as strcmp
strncmp: Same as memcmp
strxfrm: Can be used as a non-broken strncpy() if you don't mind confusing
everyone (and you don't use LC_COLLATE).
memchr: Same as memcpy
strchr, strcspn, strpbrk, strrchr, strspn, strstr, strlen: Same as strcmp
strtok: Often used badly, destroys it's input ... sometimes even horribly
abused as a side band parameter to functions.
memset: Same as memcpy

Oooh! Nice list. I wonder where you got the idea for doing this from
.... ;)
[2] Malloc implementations I've seen require at least 16 bytes of overhead
per object, so you get 16 + 4 + 1 vs. 16 + 1

Yeah, and more importantly, people trying to mitigate buffer overflows
by allocating for the worst case will, of course, waste *far more* in
overhead on average.
 
R

Richard Heathfield

Paul said:
Richard Heathfield said:
Kevin said:
As to being fast -- that's impossible unless the functions are
absolutely
trivial. The C-library basically imposes an additional minimum O(n)
on all non-trivial string manipulations.

Can you give an example of a nontrivial string manipulation that
doesn't
already have O(n) time? I strongly suspect that anything you can come
up with could be done with the C string library with no additional
overhead.

Concatenation of a string of length m with a string of length n is
O(n+m) using strcat, but O(m) if you use a string type that has its
length explicitly stored, rather than indicated by a sentinel.

Using no additional overhead [1], remember

Remember?!?!? Remember where?

In a size_t object.
In one of your processor's 6 precious
registers?

<shrug> The number of registers my processors have is not something that
concerns me when I'm writing portable code. For all I know, the program
might be running on Peter Seebach.
You also have to *remember* how much memory you have
allocated and make sure you don't spill over as well, BTW.

Thanks for reminding me. It had quite slipped my mind.
Oh yes,
and if you are communicating with a library are you going to pass
these remembered quantities around along with the string data?

That would be wise, don't you agree?
Or
will you let it work it all out with strlen by itself?

That depends on the library, of course.
Of course its
kind of hard to deduce the actual amount of memory from this
information so you either have to figure it all out from the caller
(thus duplicating some of the logic of the library) or you have to
pass it as a parameter (buring an additional register or stack.)

Yes. This is called "programming".
Or you could screw it and just buffer overflow like everyone else
does.

Can't be bothered.
Which my library (and others) is living proof of, of course. Of
course trying to do it all by hand youself without a centralize
library ... well you read about the weekly buffer overflow attacks
that get reported to www.securityfocus.com or Risks Digest or
www.news.com to see what happens when you try to do that.

I've never seen any of my production programs reported there yet.
 
D

Dan Pop

In said:
thing). The question is why builtin C strings use a sentinel method
rather than a length/end-pointer method to indicate their extent[%] - are
there any downsides to the latter?

On the PDP11, strcpy is simpler and faster with null-terminated strings;
here's the complete implementation, assuming the arguments are passed in
registers (DST is the register receiving the first argument, SRC is the
register receiving the second argument and R0 contains the return value):

STRCPY: MOV DST, R0
LOOP: MOVB (SRC)+, (DST)+
BNE LOOP
RET

But the real reason must be searched elsewhere. Languages using counted
strings provide a higher level API for string manipulation, i.e. they
take care of allocation issues in a transparent fashion and the character
count specifies not only the string length but also the size of the
space allocated to the string. If you copy a string, space for the
destination string will be automatically allocated, if you shrink a
string, the additional bytes will be automatically reclaimed by the
run time system. OTOH, such languages don't have pointers that can
point in the middle of a string and be effectively used as substrings.

The last sentence above also hints the advantage of C strings:
flexibility with minimum overhead:

char *path = "/foo/bar/baz.c";
char *file = strrchr(path, '/');
if (file == NULL) file = path;
else file++;

With counted strings, the above is impossible: a new string has to
be created to hold the file name.

C strings are well suited to a language like C, the only glitch is the
return value of strcmp and strcat: a pointer to the null character in the
destination string would be a lot more useful when concatenating together
many short strings.

Dan
 
K

Kevin Easton

Dan Pop said:
But the real reason must be searched elsewhere. Languages using counted
strings provide a higher level API for string manipulation, i.e. they
take care of allocation issues in a transparent fashion and the character
count specifies not only the string length but also the size of the
space allocated to the string. If you copy a string, space for the
destination string will be automatically allocated, if you shrink a
string, the additional bytes will be automatically reclaimed by the
run time system. OTOH, such languages don't have pointers that can
point in the middle of a string and be effectively used as substrings.

The last sentence above also hints the advantage of C strings:
flexibility with minimum overhead:

char *path = "/foo/bar/baz.c";
char *file = strrchr(path, '/');
if (file == NULL) file = path;
else file++;

With counted strings, the above is impossible: a new string has to
be created to hold the file name.

I was thinking about something more like a struct-that-isn't (similar to
_Complex, in some ways?) - where

_String path = "/foo/bar/baz.c";

creates path with a pointer to the start of the string literal and a
length of 14 - when you do:

_String file = strrchr(path, '/');

strrchr would return a _String with an internal pointer to the last / of
the string literal, and a length of 6 (so both _String objects reference
the same memory - more like augmented pointers than fully encapsulated
strings).
C strings are well suited to a language like C, the only glitch is the
return value of strcmp and strcat: a pointer to the null character in the
destination string would be a lot more useful when concatenating together
many short strings.

It would - it would also have been nice to have the limit-pointer
versions like strlcat().

- Kevin.
 
D

Dan Pop

In said:
I was thinking about something more like a struct-that-isn't (similar to
_Complex, in some ways?) - where

_String path = "/foo/bar/baz.c";

creates path with a pointer to the start of the string literal and a
length of 14 - when you do:

_String file = strrchr(path, '/');

strrchr would return a _String with an internal pointer to the last / of
the string literal, and a length of 6 (so both _String objects reference
the same memory - more like augmented pointers than fully encapsulated
strings).

If you think about it deeper, you'll realise that it would take too much
complexity hidden behind a single language feature. You have to support
all the pointer operations on the _String type, but also provide special
operations for manipulating the pointer component and the length component
separately (e.g. you need to point your _String to some allocated memory
block or to truncate your _String). The semantics of == are also
"interesting". The more I think about it, the more I see the
complexities of C++ creeping into C ;-)

Dan
 
J

James Antill

I disagree (although of course that might just mean that I have less
experience of fighting malware than you do). I find strcpy to have

I meant that often you need to find the length and sanity check it
anyway, so you almost always have all the inputs you need for a call to
memcpy()
expressive power, which is why I prefer it to memcpy when strings are
involved.

This is nice, like using NULL instead of 0, the problem comes when you
have a length metadata variable that is implicitly part of the call (Ie.
things change if/when you alter it) ... but doesn't appear in the
arguments.
Um, yes, I've seen code like that too. My LART had memmove written on it (on
the bit just surrounding the sticky-out nail), in large letters. Once the
blood had stopped flowing out quite so freely.

*breaks into song* ... "If I had a LART, I'd LART all over this world."

Of course there's six string functions to add data (including s(n)printf())
and only one memmove().
 
D

Dan Pop

In said:
That's the only defficiency?
Maybe you meant that's the only defficiency in the example. Arbitrary
sized source, source with NIL characters, substituting data, removing
parts of the data or dynamically working out what size the destination
needs to be to hold all the data. These are all handled poorly or not at
all.

You're badly missing the point of C strings. They are not supposed to
provide a general solution to *any* text manipulation problem. If you
need Perl, you know where to find it.
1. A lot of people don't normally see sprintf()/snprintf() used like this,
and so it's much easier for them to understand something that looks like
strcpy()/strncpy() with the correct semantics.

Arguments based on people's incompetence are bogus. Especially in a case
like this, where it is trivial to figure out what happens, even if you
aren't familiar with the technique.
2. People who sometimes use sprintf()/snprintf() in this way screw it up
enough that I would recommend something easier to use.

See above. People can easily misuse each and every feature of the
language and it's library.
3. The constant overhead for sprintf() is non-trivial, so you might as
well use the stpcpy() solution anyway ...

Only if, after profiling, you have determined that this is the performance
bottleneck of your application. Only fools microoptimise before
determining whether it is necessary or not. Unlike sprintf(), stpcpy()
is not a standard library function. Therefore, its usage reduces the
code readability, which is not acceptable without a *good* reason.
or think ahead and use something
better where other people have already written/tested the extra functions
for you.

Same comment as above: using extra functions reduces the code readability.
So, there must be a compelling reason for using them.

Dan
 
D

Dave Thompson

On Sun, 20 Jul 2003 16:52:40 GMT, (e-mail address removed) (Paul Hsieh) wrote:

[casting fread() as if it took void* rather than FILE*, or wrapping]
Yes, I am aware of this issue; I have made a note of this in my documentation.
I am blatantly recommending that people break the ANSI rules for this. BTW,
can you name me at least one platform where this actually ends up being an
issue? I would like to make a note of it in my documentation, but I don't seem
to have ever encountered a platform which implemented pointers for one type
different from pointers of another type. In fact I am suspicious as to whether
such a platform exists.
I'm going to treat your question as limited to pointers to different
data types; as Chris T notes downthread in C functions are in the type
system, and pointer to function is a kind of pointer, but that's not
relevant to this issue.

In addition to the usually-cited DG Eclipse and I'm pretty sure Nova,
I believe some Pr1me's had word pointers different from byte pointers.
But I doubt many are still in use, especially using C.

PDP-10 does have a different format, but the word-address bits
coincide, so writing a correctly-aligned byte pointer and using it as
a word pointer would actually work. (And there are still some clone
machines and a very few restored ones operating; the original C
compilers for this machine were long pre-Standard, but last year a gcc
port was reported to be in work!)

I do have experience of one platform that arguably still exists: the
original Tandem NonStop, aka T-16 or TNS1, had different byte and word
(16-bit) pointers. Tandem's next architecture step, the TNS2, added
byte (32-bit) "extended" pointers to everything, but still supported
"standard" 16-bit pointers. And the only-then-supported C compiler
used a word pointer for anystruct* -- so all structs were aligned and
padded at least to a two-byte boundary, even if their contents didn't
require it -- while of course void* had to be a byte pointer.

Then Tandem^WCompaq^WHP went to a RISC (MIPS) CPU, TNSR, which also
emulates TNS1 and TNS2, and the toolchain therefor is still supported
("nonnative" compiler, here in "small" model) -- and I know people who
are still compiling and running code written in the days when TNS2 was
new and extended addresses inefficient, containing 16-bit pointer
conversions that shift left and right one bit, and would fail horribly
if they didn't. They don't write new programs this way, but when they
need to make relatively small changes to (often quite large) existing
systems, they keep the existing architecture, and if they wanted to
add your code in such a environment it would fail.
[...] Perhaps you should include this wrapper function for fread in your
library, since reading from FILE * is likely to be quite common?

A possibility, but then it would force your program to link with file
manipulation functions (or at least fgetc and fread.)
Not, on most (reasonable) implementations, if you put it in a separate
source file/translation unit and hence a separate .o in the .a.

Alternatively, since you're distributing as source, just put it in a
separate file that the user can choose to compile or not.

Although, on many systems nowadays, the whole standard library (and
often quite a bit more) is always there even if you don't use it.
I know this is difficult to understand, but bstring is a *STRING LIBRARY*. It
is *NOT* a file library, and makes absolutely no impositions on the
implementation of file streams whatsoever, while still being able to use them.
This is a much more compelling argument to me. If it's out of your
intended scope, leave it out -- especially as it's trivial.

- David.Thompson1 at worldnet.att.net
 
P

Paul Hsieh

Dave Thompson said:
On Sun, 20 Jul 2003 16:52:40 GMT, (e-mail address removed) (Paul Hsieh) wrote:
[casting fread() as if it took void* rather than FILE*, or wrapping]
Yes, I am aware of this issue; I have made a note of this in my
documentation. I am blatantly recommending that people break the ANSI
rules for this. BTW, can you name me at least one platform where this
actually ends up being an issue? I would like to make a note of it in my
documentation, but I don't seem to have ever encountered a platform which
implemented pointers for one type different from pointers of another
type. In fact I am suspicious as to whether
such a platform exists.

I'm going to treat your question as limited to pointers to different
data types; as Chris T notes downthread in C functions are in the type
system, and pointer to function is a kind of pointer, but that's not
relevant to this issue.

In addition to the usually-cited DG Eclipse and I'm pretty sure Nova,
I believe some Pr1me's had word pointers different from byte pointers. [...]
PDP-10 does have a different format, but the word-address bits
coincide, so writing a correctly-aligned byte pointer and using it as
a word pointer would actually work. [...] I do have experience of one
platform that arguably still exists: the original Tandem NonStop, aka T-16
or TNS1, had different byte and word (16-bit) pointers. Tandem's next
architecture step, the TNS2, added byte (32-bit) "extended" pointers to
everything, but still supported "standard" 16-bit pointers. [...]

An interesting walk down the bowel of forgotten computer history ...
just out of curiosity, how many of these have C99 compliant C
compilers on them, and of those that don't how many are planned or
known to be in the works (that still suffer from these different
pointer representation problems)?
[...] Perhaps you should include this wrapper function for fread in your
library, since reading from FILE * is likely to be quite common?

A possibility, but then it would force your program to link with file
manipulation functions (or at least fgetc and fread.)

Not, on most (reasonable) implementations, if you put it in a separate
source file/translation unit and hence a separate .o in the .a.

Well, I prefer the one file approach for this library as I use shared
static functions that I neither want to maintain in duplicate nor
expose outside of the bstrlib module. Also, if its implemented in too
many files, then it would motivate me to make scripts/makefiles for
building libraries which are platform specific.
Alternatively, since you're distributing as source, just put it in a
separate file that the user can choose to compile or not.

Well since it was brought up, I have now put it in the documentation.
People running Novas and Tandem NonStop computers who really have a
burning desire to do new development with the bstring library can
transcribe it from there (I hesitate to say "cut and paste", as that
concept might not exist in that universe ...)
 
K

Kevin Easton

Paul Hsieh said:
Well since it was brought up, I have now put it in the documentation.
People running Novas and Tandem NonStop computers who really have a
burning desire to do new development with the bstring library can
transcribe it from there (I hesitate to say "cut and paste", as that
concept might not exist in that universe ...)

Don't look now, but you're relying on a benign manifestation of
undefined behaviour - ie. if the C Standard was more strict about
constraint violations and required a run-time error to be thrown, as
some other languages do, then it wouldn't be doable :D.

- Kevin.
 
A

Arthur J. O'Dwyer

[re: Paul's use of undefined behavior in, I suppose, the bstring
library. He tried to call fread() with an incorrect prototype, or
something, which in my limited experience seems like a really dumb
and completely avoidable thing to do]
The ANSI C standard is an inadequate standard. It is important to
recognize when the standard has failed us and promote an alternative
defacto standard behavior when it is correct to do so.

Hee hee!

Why not just fix the bug, instead of attacking the standard?
We all know how you wish ISO had provided your precious ROL opcode
with its own operator, but as far as I can tell this bug has nothing
to do with defects (real or perceived) in the Standard; it's just
a bug introduced by a faulty cast. Fix the program and the bug will
go away.

Also, this whole "promote an alternative de facto standard behavior"
thing is silly. How are *you* going to promote the alternative behavior
(presumably, using the same representation for (void *) as (FILE *))?
*You* aren't the compiler writer. Might as well try to "promote" the
"alternative standard" of 2+2=5 by replacing all occurrences of '5'
in your code.

-Arthur
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,077
Messages
2,570,566
Members
47,202
Latest member
misc.

Latest Threads

Top