Text-Based Windows Library

P

pete

CBFalconer said:
... snip ...

A bulletproof input routine using fscanf:

http://www.mindspring.com/~pfilandr/C/fscanf_input/fscanf_input.c

It would have been much simpler to simply publish it right here.
However, according to N869:

s Matches a sequence of non-white-space
characters.228)

The following is quoted from the reference and modified to reduce
whitespace:
#include <stdio.h>

#define LENGTH 40
#define str(x) # x
#define xstr(x) str(x)

int main(void)
{
int rc;
char array[LENGTH + 1];

puts("The LENGTH macro is " xstr(LENGTH) ".");
do {
fputs("Enter any line of text to continue,\n"
"or just hit the Enter key to quit:", stdout);
fflush(stdout);
rc = fscanf(stdin, "%" xstr(LENGTH) "[^\n]%*[^\n]", array);
if (!feof(stdin)) getc(stdin);
if (rc == 0) array[0] = '\0';
if (rc == EOF) puts("rc equals EOF");
else printf("rc is %d. Your string is:%s\n\n", rc, array);
} while (rc == 1);
return 0;
}

which doesn't seem to match the specification here. How come?

I think the circumflex with the newline causes
all of the characters except the newline to be in the scanset.

The net effect is the same as from get_line(),

http://www.mindspring.com/~pfilandr/C/get_line/get_line.c

except that with fscanf_input way of doing it,
the strings are cut short down to LENGTH number of characters.
 
K

Keith Thompson

user923005 wrote: [...]
A bulletproof input routine using fscanf:

http://www.mindspring.com/~pfilandr/C/fscanf_input/fscanf_input.c

Uhh ... this is bullet broof? You have to hand synchronize the LENGTH
and remember the +1 in the array declaration ... using some global
macro namespace for str() ... and this thing doesn't exactly roll off
the tongue does it? Here are the Bstrlib way of doing things:

bstring b = bgets ((bNgetc) getc, stdin, '\n');

Or:

bstring b = bSecureInput (LENGTH, '\n', (bNgetc) getc, stdin);

if the truncation semantics are that important to you. Its one line,
and there is no confusion, ambiguity or danger. Its also more
powerful as you can easily implement this on top of sockets or other
interesting input streams.

getc is a function that takes a FILE* argument and returns an int.
Your sample code converts a pointer to the getc function to type
bNgetc, which happens to be a typedef:

typedef int (*bNgetc) (void *parm);

It's legal to convert any pointer-to-function type to any other
pointer-to-function type, but using the result without converting it
back to the original type invokes undefined behavior (C99 6.3.2.3p8).

It will probably happen to work in most implementations (assuming that
function pointers have the same representation, and that void* and
FILE* are passed as arguments in the same way), and perhaps that's
good enough for your purposes, but it's worth mentioning that this is
not strictly portable.

There's a similar issue for the "compar" arguments of the standard
bsearch() and qsort() functions. The usual solution is to use a
wrapper function. For example (untested code):

int getc_wrapper(void *parm)
{
return getc((FILE*)param);
}

...

bstring b = bgets (getc_wrapper, stdin, '\n');
 
K

Keith Thompson

Ed Jensen said:
You are, of course, correct.

It's not clear that length delimited strings are faster than
terminated strings in all cases. It depends on what you do with them.
Using plain C strings, it's often possible to remember the length
rather than recomputing it.

A string is a data structure. Data structures don't have a speed; an
algorithm you apply to a data structure has a speed. The design of
the data structure, of course, can make it easy, difficult, or
impossible to implement fast algorithms that work with it, so this is
a fairly minor quibble, but it would be clearer to discuss the speed
of algorithms.

A commonly used example is strcat():

char big_string[BIG_ENOUGH];

strcpy(big_string, foo);
strcat(big_string, bar);

The call to strcat() has to re-scan the target string from the
beginning to find the terminating '\0' before it can begin copying
characters; if foo is big, the overhead can be significant. But if
you can remember the length of foo from a previous computation, you
can use that to avoid re-scanning:

strcpy(big_string, foo);
strcpy(big_string + foo_len, bar);

This isn't always possible. For example, foo might have been
initialized by some function that doesn't tell you either its length
or the location of the terminating '\0'. But a lot of the overhead
caused by using plain C strings naively can be avoided by using them
more cleverly.
 
K

Kelsey Bjarnason

[snips]

Note my qualification of "as far as the compiler is concerned".

Sure. As far as the compiler is concerned, you can get away with hundreds
of different types of undefined behavior as well; the compiler is under
_no_ obligation to even _try_ to figure it out. This doesn't mean that
such code is properly built - just as passing the wrong thing to the str*
functions may get past the compiler, but that doesn't mean the code is
right.

If you're a code monkey, the only question is what _sort_ of code monkey
are you: the sort who takes the time to get things right, within
reasonable limits, or the sort who is liable to say "It compiles... ship
it"?

The former sort tend not to have this sort of problem as a rule. The
latter... well... fine, yes, they do have this problem. They also tend to
have 197 other sorts of problems, so focusing on this one in particular
seems a bit pointless.
 
W

websnarf

You are, of course, correct.

It's not clear that length delimited strings are faster than
terminated strings in all cases. [...]

Actually its clear to anyone who has measured it.
[...] It depends on what you do with them.
Using plain C strings, it's often possible to remember the length
rather than recomputing it.

Remember it? Remember it where? In a variable somewhere? How about
just holding it in another field instead? Wait! That's exactly what
length delimited strings are isn't it?
A string is a data structure. Data structures don't have a speed; an
algorithm you apply to a data structure has a speed. The design of
the data structure, of course, can make it easy, difficult, or
impossible to implement fast algorithms that work with it, so this is
a fairly minor quibble, but it would be clearer to discuss the speed
of algorithms.

A commonly used example is strcat():

char big_string[BIG_ENOUGH];

Of course this may lead to a load-time error. But that's never
discussed.
strcpy(big_string, foo);
strcat(big_string, bar);

And these of course are just buffer overflows waiting to happen
anyways.
The call to strcat() has to re-scan the target string from the
beginning to find the terminating '\0' before it can begin copying
characters; if foo is big, the overhead can be significant.

You are also ignoring the fact that each character from foo and bar
are checked against '\0' for no really good reason. Block copying
mechanisms from the underlying platform are not available. So *both*
the strcpy and strcat functions are potentially slower than they need
to be.
[...] But if
you can remember the length of foo from a previous computation, you
can use that to avoid re-scanning:

strcpy(big_string, foo);
strcpy(big_string + foo_len, bar);

This isn't always possible.

And it *ALWAYS* increases danger, because there are no language
semantics or assistance available to you to keep bigs_string and
foo_len in synch. Its horrible in this case because what happens if
you decide you want to insert a strcat (big_string, "|"); in between
those two lines? Writing code like this is just inherently
unmaintainable.

And of course, in a discussion about performance you completely miss
the real performance opportunity:

memcpy (big_string, foo, foo_len);
strcpy(big_string + foo_len, bar);

Which, of course, should generally go faster, but does nothing to
alleviate the danger of buffer overflows or maintenance issues.
Though there still isn't anything being done about the second
strcpy().
[...] For example, foo might have been
initialized by some function that doesn't tell you either its length
or the location of the terminating '\0'. But a lot of the overhead
caused by using plain C strings naively can be avoided by using them
more cleverly.

Besides arguing incorrectly, you argue by straw man:

1) With length delimited string you can perform unrolled string
searching algorithms without testing each character for '\0'.
Furthermore if the two strings are of similar length, and they
completely mismatch, then you avoid any comparisons which would take
the tail of the search string beyond the tail of the string you are
searching in.

2) String comparison includes length checking which speeds up the
typical scenario of long prefix matching for different strings.

3) Operations such as insert and delete can be done in one pass.

4) And as long as we talking about "clever tricks", any block-based
algorithms (such as is possible for things like "toupper" or "tolower"
or character scanning) become unavailable unless the length is known.
 
U

user923005

As long as programmers are able to code carefully (read 'without bugs
in their code') then a better string type for C will not be helpful.
Keep in mind that a corporation destroying bug is not dangerous
because all you have to do is never make a mistake and it won't
happen.
On the other hand, for those of us who do have a bug once in a great
while, it is comforting to know that a mistrake won't wipe out the
entire corporation.
 
¬

¬a\\/b

speaking not for the C language but for its library

one thing is to have a set of functions in a library that can allow
20_000_000 type of differents errors
other thing is to have a set of functions in a library that allow
200 type of differents errors
 
C

Chris Hills

SM Ryan said:
# Where can I find a library to created text-based windows applications?

Possibly....on Unix the library is called curses. If you google
"curses library windows" it will show you some candidate libraries
you can investigate. Using curses will simplify porting to Unix
if you should ever want to.


There are several "Curses" libraries for the PC. I should have on if you
can't find one.
 
K

Kelsey Bjarnason

[snips]

It's not clear that length delimited strings are faster than
terminated strings in all cases. [...]

Actually its clear to anyone who has measured it.

Really. Explain how such strings will improve the speed of strchr.

Using the conventional approach, each operation involves the following:

Dereference the pointer
Compare value against \0
If comparison fails, compare value against "needle"
If comparison fails, increment pointer and loop

With an integer-stored-length string variant, the operation is more like:
Compare index to length
if comparison fails, dereference pointer+index
Compare value against "needle"
If comparison fails, increment index and loop

One might argue that comparing the index to the length might be faster
than comparing the characters, but if so it should be pointed out that the
base+index computation may well be slower than the pointer dereference; it
is not clear that either is guaranteed to be a performance win on all
possible implementations.

Pedantic, perhaps, but the statement involved "all cases", not just some.
[...] It depends on what you do with them. Using plain C strings, it's
often possible to remember the length rather than recomputing it.

Remember it? Remember it where? In a variable somewhere? How about
just holding it in another field instead? Wait! That's exactly what
length delimited strings are isn't it?

Yes, but that involves additional computation with any string-modifying
operation, when such computation may only be necessary - or even desirable
- in a small number of cases.
A commonly used example is strcat():

char big_string[BIG_ENOUGH];

Of course this may lead to a load-time error. But that's never
discussed.

Depends how large BIG_ENOUGH is; if it's 256 bytes, it is unlikely to
cause problems. If it's 256K, there's no expectation the code will work
at all, except in a limited number of cases.
And these of course are just buffer overflows waiting to happen anyways.

Assuming one doesn't know how long foo and bar are. Generally, when I
write my code, I actually pay attention to such details.
You are also ignoring the fact that each character from foo and bar are
checked against '\0' for no really good reason. Block copying
mechanisms from the underlying platform are not available. So *both*
the strcpy and strcat functions are potentially slower than they need to
be.

Sure; length-specified strings can be faster in some cases. I think
you'll find it hard to prove they're faster in all cases, as you assert.
And it *ALWAYS* increases danger, because there are no language
semantics or assistance available to you to keep bigs_string and foo_len
in synch.

Umm... so? It doesn't take any particular genius to load a block of data
from a file, for example, then compare length of input + length of
existing data to length of buffer.
Its horrible in this case because what happens if you decide
you want to insert a strcat (big_string, "|"); in between those two
lines?

Then you increment your size by one.
And of course, in a discussion about performance you completely miss
the real performance opportunity:

memcpy (big_string, foo, foo_len);
strcpy(big_string + foo_len, bar);

Which, of course, should generally go faster, but does nothing to
alleviate the danger of buffer overflows or maintenance issues.

Hmm. This would seem to require either that memcpy be replaced with a
function which has innate knowledge of the new length-managed string type,
or exposing the string data itself to the outside world - outside the
control of the built-in string functions. This means we can modify the
buffer directly - potentially making the length field completely invalid.

I the point here is to prevent bad programming practices - the sort that
lead to buffer overflows - this strikes me as a not overly good approach.
You need to keep the entire thing in an opaque type, but if you do, this
removes the option of doing the memcpy above.
Besides arguing incorrectly, you argue by straw man:

1) With length delimited string you can perform unrolled string
searching algorithms without testing each character for '\0'.

You have to compare the index to the length, instead. Replacing one
integer comparison with another doesn't seem all that significant.
Furthermore if the two strings are of similar length, and they
completely mismatch, then you avoid any comparisons which would take the
tail of the search string beyond the tail of the string you are
searching in.

Not sure what that means. Example:

a_str = "abcde";
b_str = "fghijkl";

Similar strings, complete mismatches. Comparison of the strings involves
comparison of _one_ character: 'a' is not equivalent to 'f', so the
strings don't compare, so why compare further? Or perhaps you mean
strstr, rather than strcmp? In which case:

a_str = "xxx";
b_str = "abcdef";

Similar lengths, complete mismatch... but a maximum of four character
comparisons, plus determination of length, are required to determine this.
In the case of long strings, the gains could be more significant, but if
you're comparing long strings on a regular basis, I'd tend to think Boyer
Moore or the like would be more appropriate, and these perform quite well;
your only gain would be in calculating end-of-string, and that's assuming
the code doesn't already have such information available.
2) String comparison includes length checking which speeds up the
typical scenario of long prefix matching for different strings.

Assuming one doesn't already have such information available. If I'm
writing code to process thousands of strings, each potentially thousands
or even tens of thousands of characters long, I'm _already_ going to be
recording string lengths, and using modified algorithms which make use of
this information.
3) Operations such as insert and delete can be done in one pass.
4) And as long as we talking about "clever tricks", any block-based
algorithms (such as is possible for things like "toupper" or "tolower"
or character scanning) become unavailable unless the length is known.

Yes, yes, again, some operations can be improved. Even most. However,
once again, the key concept was "all cases". It has not been demonstrated
that this mechanism would, in fact, improve all cases. The simple example
of strchr, for example, throws the notion in doubt.
 
R

Richard Harter

[snips]

It's not clear that length delimited strings are faster than
terminated strings in all cases. [...]

Actually its clear to anyone who has measured it.

Really. Explain how such strings will improve the speed of strchr.

Using the conventional approach, each operation involves the following:

Dereference the pointer
Compare value against \0
If comparison fails, compare value against "needle"
If comparison fails, increment pointer and loop

With an integer-stored-length string variant, the operation is more like:
Compare index to length
if comparison fails, dereference pointer+index
Compare value against "needle"
If comparison fails, increment index and loop

One might argue that comparing the index to the length might be faster
than comparing the characters, but if so it should be pointed out that the
base+index computation may well be slower than the pointer dereference; it
is not clear that either is guaranteed to be a performance win on all
possible implementations.

This is a bad example; if you have a length count you can unroll the
loop and avoid most of the comparisons between length and index. This
is a standard optimization; many compilers will even do it for you. If
you are scanning a \0 terminated string loop unrolling is not available.
 
K

Kelsey Bjarnason

[snips]

This is a bad example; if you have a length count you can unroll the
loop and avoid most of the comparisons between length and index.

You're still left with address and offset computation, or conversion to
pointer+incrementing. Any changes in speed one way or the other are
going to be at best minimal.
 
C

Chris Torek

... if you have a length count you can unroll the [strchr-or-equivalent]
loop and avoid most of the comparisons between length and index. This
is a standard optimization; many compilers will even do it for you. If
you are scanning a \0 terminated string loop unrolling is not available.

Well, yes, except that implementations can "cheat", and unroll
strchr() anyway, after first ensuring that the address is (say) 0
mod 4. On some architectures (e.g., the original Alpha or MIPS)
this is pretty much the only way to handle the loop.

The obvious way to create an artificial situation in which
counted-length-strings underperfom zero-terminated-strings
is to create a lot of single-character strings in the first
version, and re-use the zero-terminator in the second:

loop {
newstr = substring(original, pos, 1);
if (compare_strings(newstr, looking_for) == match) ...
release_string(newstr);
}

vs:

newstr[1] = '\0';
loop {
newstr[0] = char_at(original, pos);
if (compare_strings(newstr, looking_for) == match) ...
}

Of course, you can make the first one perform the same as the second
by doing character (instead of string) operations (on newstr[0]
instead of newstr) -- but in languages that have counted-length-strings
as built-in primitive types, one often finds programmers allocating
and releasing single-character "strings" inside inner loops. This
may be where some of the "anti-counted-length-strings" bias comes
from. (I believe another chunk of bias comes from implementations
that limit counted-length strings to 255 bytes maximum: clearly a
bad idea, yet it occurs over and over again.)
 
R

Richard Harter

... if you have a length count you can unroll the [strchr-or-equivalent]
loop and avoid most of the comparisons between length and index. This
is a standard optimization; many compilers will even do it for you. If
you are scanning a \0 terminated string loop unrolling is not available.

Well, yes, except that implementations can "cheat", and unroll
strchr() anyway, after first ensuring that the address is (say) 0
mod 4. On some architectures (e.g., the original Alpha or MIPS)
this is pretty much the only way to handle the loop.

Point conceded. To be fair though, once we are looking under the hood
at particular implementations, there are a lot of little tricks one can
use to juice performance in count delimited strings. The original
posting said in all instances; that has to be an overstatement.
The obvious way to create an artificial situation in which
counted-length-strings underperfom zero-terminated-strings
is to create a lot of single-character strings in the first
version, and re-use the zero-terminator in the second:

loop {
newstr = substring(original, pos, 1);
if (compare_strings(newstr, looking_for) == match) ...
release_string(newstr);
}

vs:

newstr[1] = '\0';
loop {
newstr[0] = char_at(original, pos);
if (compare_strings(newstr, looking_for) == match) ...
}

Of course, you can make the first one perform the same as the second
by doing character (instead of string) operations (on newstr[0]
instead of newstr) -- but in languages that have counted-length-strings
as built-in primitive types, one often finds programmers allocating
and releasing single-character "strings" inside inner loops.

How depressing.

It seems to me that for count-strings one wants a substring operation
that just points to a position in the original string. One
distinguishes between strings with modifiable content and those with
non-modifiable content. With that concept your first instance is simply

loop {
newstr = substring(original, pos, 1);
if (compare_strings(newstr, looking_for) == match) ...
}

and under the hood we have something like:

newstr.ptr = original.ptr + pos;
newstr.cnt = 1;

An advantage of count-strings is that you can refer to arbitrary
substrings of an original string. With terminated-strings you can only
refer to suffix substrings.

This
may be where some of the "anti-counted-length-strings" bias comes
from. (I believe another chunk of bias comes from implementations
that limit counted-length strings to 255 bytes maximum: clearly a
bad idea, yet it occurs over and over again.)

There's a little issue involved. Commonly the count in a count-string
is packaged into the start of the string. When you do that then you're
stuck with a fixed format for the count (well, yes, you can wiggle
around it but that has a cost). But if you don't package the count with
the string then the count can go irretrievably lost. The advantage of
having a terminating character is the length can't go lost.
 
Q

qed

Chris said:
... if you have a length count you can unroll the [strchr-or-equivalent]
loop and avoid most of the comparisons between length and index. This
is a standard optimization; many compilers will even do it for you. If
you are scanning a \0 terminated string loop unrolling is not available.

Well, yes, except that implementations can "cheat", and unroll
strchr() anyway, after first ensuring that the address is (say) 0
mod 4. On some architectures (e.g., the original Alpha or MIPS)
this is pretty much the only way to handle the loop.

Yes but you still must perform and additional '\0' scan. Its still an
unroll, but not a very effective one.
The obvious way to create an artificial situation in which
counted-length-strings underperfom zero-terminated-strings
is to create a lot of single-character strings in the first
version, and re-use the zero-terminator in the second:

loop {
newstr = substring(original, pos, 1);
if (compare_strings(newstr, looking_for) == match) ...
release_string(newstr);
}

vs:

newstr[1] = '\0';
loop {
newstr[0] = char_at(original, pos);
if (compare_strings(newstr, looking_for) == match) ...
}

Ok, but this is clearly nonsense. You obviously would write:

if (1 == looking_for->slen) {
lffc = looking_for->data[0];
do {
if (original->data[pos] == lffc) ...
...
} ...
}


Of course, you can make the first one perform the same as the second
by doing character (instead of string) operations (on newstr[0]
instead of newstr) -- but in languages that have counted-length-strings
as built-in primitive types, one often finds programmers allocating
and releasing single-character "strings" inside inner loops.

Ok, well this is comp.lang.c. You are conflating the fast single
character primitives with '\0' termination as if they only went with
each other.
> [...] This
may be where some of the "anti-counted-length-strings" bias comes
from. (I believe another chunk of bias comes from implementations
that limit counted-length strings to 255 bytes maximum: clearly a
bad idea, yet it occurs over and over again.)

So the bias just comes from pure nonsense? This is a bias against other
languages, not against length delimited strings.
 
D

David Thompson

On Wed, 18 Apr 2007 22:23:52 +0800, Clem Clarke
And for those people that suggest that C does have a string type, I
would suggest that it doesn't, compared with Pascal, PL/I, COBOL or even
Assembler. It has an array of characters, which is quite different from
a string.
COBOL, FORTRAN, and PL/I have a true(?) string type, i.e., distinct
from array. APL and Pascal and Ada in the basic case do use 'array of
char', but they all have (unlike C) safe arrays (of all types), and
APL and Ada have additional features for arrays (of all types) some of
which are particularly useful for treating array of char as string.
Pascal, except for a few library functions, is rather a pain for doing
string handling, which was probably one of the reasons for the
underwhelmingness of its success. And none of them treat char(acter)
as just a small integer/number, as C does.

Most assembly languages (there are many that used the simple and
obvious name assembler) support string data (AFAIK always just as
bytes) and whatever instructions the underlying machine has for
operating on them -- which typically also just deal with bytes. As a
simple and basic example, S/360 MVC can as easily do the right size or
the wrong size,and IIRC BAL *defaults* to the declared (presumably
right) size but can be overridden.
For example, in PL/I, one could write:

Dcl String10 char(10) varying;
Dcl String20 char(20) varying init
('Long 20 Character XX');


String10-string20;

This will automatically truncate the result, rather than copy the 20
bytes and overwrite adjacent storage, as in C.
Assuming you meant string10 = string20, and compared to strcpy in C,
since plain assignment won't even compile, and some other kinds of
copy like strlcpy won't have the problem.

And assuming STRINGSIZE is not enabled with an applicable handler that
forces something different -- but even then it would still be safe.

<snip rest>
- formerly david.thompson1 || achar(64) || worldnet.att.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,161
Messages
2,570,892
Members
47,427
Latest member
HildredDic

Latest Threads

Top