Read only last line-

J

Jordan Abel

Jordan Abel wrote On 02/22/06 14:37,:

Keep in mind that we're speaking of text streams, where
the number of characters written to a stream need not be the
same as the number of bytes written to the file. A familiar
example is putc('\n', stream) on Windows, where one character
generates two bytes. There are also systems where writing a
newline produces no bytes in the file, systems where a file
contains both data bytes and metadata bytes, and systems that
use state-dependent encodings for extended character sets.

If you're dealing with something that might be a state-dependent
encoding, you should probably be using fgetpos and fsetpos
exclusively.
It's not so much a problem of U.B., but of failure that
doesn't produce a reliable indication. Seek to a position that
happens to be in the middle of a multi-byte character or in the
middle of a stretch of metadata, and the problem may be difficult
to detect: a byte in a file does not always stand alone, but may
require prior context (at an arbitrary separation) for proper
interpretation. Here's the stuff of a nightmare or two: Imagine
opening a stream for update, seeking to the middle of a stretch of
metadata, successfully writing "Hello, world!" there, and only
later discovering that the successful write has corrupted the file
structure and made the entire tail end unreadable ...

An implementation may silently force a file opened in update mode to
be a binary stream. An implementation that has such issues probably
should do so. (It would be nice if some way were provided for the
program to detect this, but unfortunately there does not seem to be)
It would be nice if one could do meaningful arithmetic on file
position indicators in text streams, but given the rich variety of
file encodings that exist in the world it is not always possible
to do so.

There is a difference between "not meaningful" and "undefined" - I
am entirely opposed to the dilution of the term "undefined behavior"
in this newsgroup.

I think that the implementation should detect all those issues and
treat them as "a request that cannot be satisfied", and return a
value indicating failure. I think there is a reading of the standard
which supports this view.
 
E

Eric Sosman

Jordan Abel wrote On 02/22/06 15:54,:
If you're dealing with something that might be a state-dependent
encoding, you should probably be using fgetpos and fsetpos
exclusively.

Right. And this means you can't do arithmetic on the
file positions of a text stream, because fpos_t need not
be an arithmetic type.
It's not so much a problem of U.B., [...]

There is a difference between "not meaningful" and "undefined" - I
am entirely opposed to the dilution of the term "undefined behavior"
in this newsgroup.

We seem to be in violent agreement.
I think that the implementation should detect all those issues and
treat them as "a request that cannot be satisfied", and return a
value indicating failure. I think there is a reading of the standard
which supports this view.

I don't see how the issues can be detected, not with
any pretense of efficiency. One could get reliable detection
by implementing fseek() as read-and-count, perhaps preceded
by rewind(), but the result would be horrible. True, the
Standard doesn't promise efficiency, and an fseek() that
behaved this way would satisfy the letter of the Standard's
law. Equally, an fseek() that returned -1 unconditionally
would meet the letter of the law; so would a malloc() that
always returned NULL, a time() that always returned -1, and
a rand() that always returned 42. "It's just a quality of
implementation concern," but it's folly to ignore QoI.
 
M

Mark McIntyre

Again, failure is not the same as UB. What is a specific case that you
think invokes UB?

You get a result saying you can write to byte 23456, but by the time
you try, the file no longer contains any bytes at that location. Or
some other thread has written to them already and locked them. In
such circumstances, Paul's variants on the standard functions are
better in that they probably avoid UB, but still not reliable.

Mark McIntyre
 
W

websnarf

Eric said:
Jordan Abel wrote On 02/22/06 14:37,:

Well in my proposal the error return is specifically -1. I hadn't
considered file streams like stdin and stdout where clearly you can't
fseek, but obviously they would just return with -1 -- certainly not
UB.
[...] It is arguable that this also applies to a call of fseek on a text
stream with a value that does not correspond to a position in the file
which ftell might have returned.

My proposal is for two new functions fseekB and ftellB which are not
ftell or fseek compatible.
Keep in mind that we're speaking of text streams, where
the number of characters written to a stream need not be the
same as the number of bytes written to the file. A familiar
example is putc('\n', stream) on Windows, where one character
generates two bytes.

So what? If you read that back on Windows, you also get just one
character. What does this mean? It means that it has to count as 1
character (so long as you read the file in text mode.) It doesn't
count *underlying byte representation*, it counts offset in the units
of "characters" or whatever it is that is being written to the file.
[...] There are also systems where writing a
newline produces no bytes in the file, systems where a file
contains both data bytes and metadata bytes, and systems that
use state-dependent encodings for extended character sets.

Underlying file system details do not affect what I have specified. If
you put the contents of a file into an array, then that specifies an
offset to data mapping. That's the mapping you have to support. Its
not impossible, and its not even very hard. Not if your system
supports faithful read-write turn around, and fgetpos/fsetpos.
It's not so much a problem of U.B., but of failure that
doesn't produce a reliable indication. Seek to a position
that happens to be in the middle of a multi-byte character
or in the middle of a stretch of metadata,

How does that happen for a file opened in text mode?
[...] and the problem
may be difficult to detect: a byte in a file does not always
stand alone, but may require prior context (at an arbitrary
separation) for proper interpretation. Here's the stuff of
a nightmare or two: Imagine opening a stream for update,
seeking to the middle of a stretch of metadata, successfully
writing "Hello, world!" there, and only later discovering
that the successful write has corrupted the file structure
and made the entire tail end unreadable ...

Well explain to me how that happens -- remember I am mapping from
offsets of the original data, as if it were all coming from an array to
positions in the underlying file (that we know *exists* because of the
existence of fgetpos, fsetpos functions). So what bad thing is
supposed to happen?
It would be nice if one could do meaningful arithmetic on
file position indicators in text streams,

You mean its nice to know that it is well defined and possible. (You
need a good definition of intmax_t, of course.)
[...] but given the rich
variety of file encodings that exist in the world it is not
always possible to do so.

It might be slow, but its always possible.
[...] The C Standard recognizes this
difficulty, and so does not attempt to guarantee that seeking
to arbitrary positions in text files will work as desired.

Even though it presents an API that clearly implies that it does.
The Standard is cognizant of imperfections in reality, and
does not insist that reality rearrange itself for the Standard's
convenience.

If that were a true and complete description of the standard that would
at least be a defensible and credible stance. But its not. If they
took this stance, ftell() and fseek() would be gone, since
fgetpos/fsetpos already gives you the weaker semantics.
 
K

Keith Thompson

Eric Sosman wrote: [...]
[...] It is arguable that this also applies to a call of fseek on a text
stream with a value that does not correspond to a position in the file
which ftell might have returned.

My proposal is for two new functions fseekB and ftellB which are not
ftell or fseek compatible. [...]
Keep in mind that we're speaking of text streams, where
the number of characters written to a stream need not be the
same as the number of bytes written to the file. A familiar
example is putc('\n', stream) on Windows, where one character
generates two bytes.

So what? If you read that back on Windows, you also get just one
character. What does this mean? It means that it has to count as 1
character (so long as you read the file in text mode.) It doesn't
count *underlying byte representation*, it counts offset in the units
of "characters" or whatever it is that is being written to the file.
[...]

So something like
fseekB(some_file, 100000, SEEK_SET);
would, on some systems, actually have to read 1 million characters
from the file to find the proper position. On Windows, where an
end-of-line is represented in a text file as a CR-LF pair, there would
be no other way to find the 1 millionth character of the file
(counting each CR-LF pair as one character). On Unix, on the other hand,
it would simply be equivalent to
fseek(some_file, 1000000, SEEK_SET);
and would be much faster.

This might be conceptually cleaner than the existing fseek/ftell
interface, but I'm not convinced that it would be useful.
 
E

Eric Sosman

Keith Thompson wrote On 02/27/06 14:44,:
[replacements for fseek/ftell that count "delivered
characters" instead of "recorded bytes"]

[...]

So something like
fseekB(some_file, 100000, SEEK_SET);

Missing a zero, I think.
would, on some systems, actually have to read 1 million characters
from the file to find the proper position. On Windows, where an
end-of-line is represented in a text file as a CR-LF pair, there would
be no other way to find the 1 millionth character of the file
(counting each CR-LF pair as one character). On Unix, on the other hand,
it would simply be equivalent to
fseek(some_file, 1000000, SEEK_SET);
and would be much faster.

Not even Unix can do this efficiently in the presence
of variable-length or state-dependent character encodings.
 
K

Keith Thompson

Eric Sosman said:
Keith Thompson wrote On 02/27/06 14:44,:
[replacements for fseek/ftell that count "delivered
characters" instead of "recorded bytes"]

[...]

So something like
fseekB(some_file, 100000, SEEK_SET);

Missing a zero, I think.
Yes.
would, on some systems, actually have to read 1 million characters
from the file to find the proper position. On Windows, where an
end-of-line is represented in a text file as a CR-LF pair, there would
be no other way to find the 1 millionth character of the file
(counting each CR-LF pair as one character). On Unix, on the other hand,
it would simply be equivalent to
fseek(some_file, 1000000, SEEK_SET);
and would be much faster.

Not even Unix can do this efficiently in the presence
of variable-length or state-dependent character encodings.

Ok, but it can do so in their absence. (I suppose it's
locale-dependent?)
 
W

Walter Roberson

Eric Sosman said:
True, the
Standard doesn't promise efficiency, and an fseek() that
behaved this way would satisfy the letter of the Standard's
law. Equally, an fseek() that returned -1 unconditionally
would meet the letter of the law; so would a malloc() that
always returned NULL, a time() that always returned -1, and
a rand() that always returned 42.

In an implementation that rand() always returned 42, then
RAND_MAX would be 42, but C89 requires RAND_MAX to be at
least 32767.

Now, if rand() always returned 32767, then -that- might be within
the letter of the standard ;-)

Let's see, how perverse could one get...? How about:
rand() returns 0 continually upon srand(0),
rand() returns RAND_MAX continually upon srand(RAND_MAX),
rand() returns 42 continually otherwise (including the
default case srand(1))
 
J

Jordan Abel

In an implementation that rand() always returned 42, then
RAND_MAX would be 42, but C89 requires RAND_MAX to be at
least 32767.

I don't think it's required that rand() ever return RAND_MAX.
 
K

Keith Thompson

Jordan Abel said:
I don't think it's required that rand() ever return RAND_MAX.

The statement in the standard is:

The rand function computes a sequence of pseudo-random integers in
the range 0 to RAND_MAX.

Whether a rand() implementation that never returns RAND_MAX would be
conforming is a question I'm not going to try to answer.
 
E

Eric Sosman

Keith Thompson wrote On 02/27/06 17:01,:
The statement in the standard is:

The rand function computes a sequence of pseudo-random integers in
the range 0 to RAND_MAX.

Whether a rand() implementation that never returns RAND_MAX would be
conforming is a question I'm not going to try to answer.

Could a conforming program prove that rand() is unable
to return RAND_MAX? The number of samples required is a
function of the number of bits in rand()'s internal state,
and the Standard does not document that number.
 
K

Keith Thompson

Eric Sosman said:
Keith Thompson wrote On 02/27/06 17:01,:

Could a conforming program prove that rand() is unable
to return RAND_MAX? The number of samples required is a
function of the number of bits in rand()'s internal state,
and the Standard does not document that number.

I'm guessing you meant "strictly conforming"; a "conforming program"
can do just about anything, since it's free to use extensions.

A rand() implementation that always repeatedly returns the same number
(perhaps a different number depending on the seed) could conceivably
be truly pseudo-random, but very unlucky. Even a truly random
sequence could contain the same number repeated an arbitrary number of
times, and there's no set number of repetitions that can prove that
it's non-random. It is possible to discuss the probability that a
given non-random-appearing sequence could have been generated, and
compare that to, say, the probability that the programmer who wrote
the rand() function forgot to update the seed.
 
C

CBFalconer

Walter said:
In an implementation that rand() always returned 42, then
RAND_MAX would be 42, but C89 requires RAND_MAX to be at
least 32767.

Now, if rand() always returned 32767, then -that- might be within
the letter of the standard ;-)

Let's see, how perverse could one get...? How about:
rand() returns 0 continually upon srand(0),
rand() returns RAND_MAX continually upon srand(RAND_MAX),
rand() returns 42 continually otherwise (including the
default case srand(1))

I propose:

int rand(void) {
static int rn;

if (rn) return rn = 32767;
else return rn = 0;
}

void srand(int seed) {
if (seed) {
if (rand()) rand();
rand();
}
else if (rand()) rand();
}

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>
 
E

Eric Sosman

Keith Thompson wrote On 02/27/06 18:10,:
I'm guessing you meant "strictly conforming"; a "conforming program"
can do just about anything, since it's free to use extensions.

On the other hand, a strictly conforming program cannot
permit its output to be influenced by "unspecified, undefined,
or implementation-defined behavior." For example, an S.C.
program cannot output a value obtained from rand(), nor the
number of samples it took before rand() returned RAND_MAX.

... which leads to an odd state of affairs: If every
rand() implementation is required to return RAND_MAX sooner
or later, an S.C. program can announce that it's own rand()
eventually did so (of course, it must not exhibit the actual
value of RAND_MAX, which is implementation-defined). But
if it is permissible for conforming implementation's rand()
to omit RAND_MAX, an S.C. program cannot output the result
of the test! For then its output ("Saw RAND_MAX" or "No
RAND_MAX") would depend on the particular rand(), which is
implementation-defined.

In other words, the test program can only be strictly
conforming if the test is unnecessary to begin with!
 
W

websnarf

Keith said:
Eric Sosman wrote: [...]
[...] It is arguable that this also applies to a call of fseek on a text
stream with a value that does not correspond to a position in the file
which ftell might have returned.

My proposal is for two new functions fseekB and ftellB which are not
ftell or fseek compatible. [...]
Keep in mind that we're speaking of text streams, where
the number of characters written to a stream need not be the
same as the number of bytes written to the file. A familiar
example is putc('\n', stream) on Windows, where one character
generates two bytes.

So what? If you read that back on Windows, you also get just one
character. What does this mean? It means that it has to count as 1
character (so long as you read the file in text mode.) It doesn't
count *underlying byte representation*, it counts offset in the units
of "characters" or whatever it is that is being written to the file.
[...]

So something like
fseekB(some_file, 100000, SEEK_SET);
would, on some systems, actually have to read 1 million characters
from the file to find the proper position.

On some systems fseek(some_file, 100000, SEEK_SET) already *DOES* have
to read 1 million characters.
[...] On Windows, where an
end-of-line is represented in a text file as a CR-LF pair, there would
be no other way to find the 1 millionth character of the file
(counting each CR-LF pair as one character).

That's not exactly true, but it is going to be slow. I know this
because WATCOM C/C++ on Windows 98 actually *DOES* suffer from this
performance hit.
[...] On Unix, on the other hand,
it would simply be equivalent to
fseek(some_file, 1000000, SEEK_SET);
and would be much faster.

This might be conceptually cleaner than the existing fseek/ftell
interface, but I'm not convinced that it would be useful.

Well I don't know what universe or era of software development you live
in. Look at a program like BitTorrent. Bit torrent downloads files in
blocks that appear is *necessarily* random order. They have no control
or choice over that. How does it work without a functioning fseek()
(with fseekB semantics)? It opens the file in binary mode, so these
issues make it somewhat moot, but you cannot say that this is not a
useful function.

Think of a text editor. Many text editors today actually give an
offset positions for the characters you type. It seems to me that
there are many opportunities to use fseekB-like functionality to manage
file IO.

So I don't think it should be up to *you* whether or not its useful or
not. And the fact is, without these additional semantics fseek and
ftell we know for sure today is *NOT* useful, because fgetpos/fsetpos
exist with all the semantics of those functions without the false
implication of ordinary long int semantics (i.e., the ability to add
and subtract to them).
 
S

S.Tobias

Keith Thompson said:
The statement in the standard is:

The rand function computes a sequence of pseudo-random integers in
the range 0 to RAND_MAX.

Whether a rand() implementation that never returns RAND_MAX would be
conforming is a question I'm not going to try to answer.
This is a strange interpretation. I'd assume the Std means that
the integers are guaranteed to be within the [0;RAND_MAX] range,
not that the whole range is guaranteed to be _potentially_ covered.
Since the Std doesn't define the generator, I see no point in
guaranteeing the latter. (I have nothing to support this, just
common sense.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,175
Messages
2,570,947
Members
47,498
Latest member
yelene6679

Latest Threads

Top