Text mode fseek/ftell

K

Kenneth Brody

I recently ran into an "issue" related to text files and ftell/fseek,
and I'd like to know if it's a bug, or simply an annoying, but still
conforming, implementation.

The platform is Windows, where text files use CF+LF (0x0d, 0x0a) to
mark end-of-line. The file in question, however, was in Unix format,
with only LF (0x0a) at the end of each line.

First, does the above situation already invoke "implementation defined"
or "undefined" behavior? Or is it still "defined"?

The problem comes in how ftell() reports the current position. (And,
subsequently fseek()ing back to the same position is wrong.)

Suppose that you have fread() the following 12 characters, starting at
the beginning of the file:

'1' '2' '3' '4' '5' 0x0a '1' '2' '3' '4' '5' 0x0a

(Remember, this file is in Unix format, with a single 0x0a for end-of-
line.)

While you are now at offset 12 within the file, ftell() will return 14,
because it assumes that those '\n' newlines are really CR+LF, and that
the CR was stripped off when read. (Had this file been in Windows format,
you really would be at offset 14 after reading those 12 characters.) For
each 0x0a returned by fread(), ftell() will assume you have advanced two
characters in the file.

The net result here is that a subsequent fseek() to the same position
will be wrong.


So, have I invoked undefined behavior by reading a Unix text file in a
Windows environment? Or is the compiler allowed to return the "wrong"
value as part of an "implementation defined" restriction? Or is this
a bug in the compiler's runtime library?

--
+-------------------------+--------------------+-----------------------------+
| Kenneth J. Brody | www.hvcomputer.com | |
| kenbrody/at\spamcop.net | www.fptech.com | #include <std_disclaimer.h> |
+-------------------------+--------------------+-----------------------------+
Don't e-mail me at: <mailto:[email protected]>
 
B

Ben Bacarisse

I recently ran into an "issue" related to text files and ftell/fseek,
and I'd like to know if it's a bug, or simply an annoying, but still
conforming, implementation.

The platform is Windows, where text files use CF+LF (0x0d, 0x0a) to
mark end-of-line. The file in question, however, was in Unix format,
with only LF (0x0a) at the end of each line.

First, does the above situation already invoke "implementation defined"
or "undefined" behavior? Or is it still "defined"?

No you should be OK. Will be able to do more things with fseek if you
open the file as a binary file, but opening it as text should also work is
you keep to the restrictions imposed by the standard (see later).
The problem comes in how ftell() reports the current position. (And,
subsequently fseek()ing back to the same position is wrong.)

Suppose that you have fread() the following 12 characters, starting at
the beginning of the file:

'1' '2' '3' '4' '5' 0x0a '1' '2' '3' '4' '5' 0x0a

(Remember, this file is in Unix format, with a single 0x0a for end-of-
line.)

While you are now at offset 12 within the file, ftell() will return 14,
because it assumes that those '\n' newlines are really CR+LF, and that
the CR was stripped off when read. (Had this file been in Windows
format, you really would be at offset 14 after reading those 12
characters.) For each 0x0a returned by fread(), ftell() will assume you
have advanced two characters in the file.

Actually, you can't say anything about the numbers. For a text file,
ftell does not give you the offset. It returns a code that can only be
used by fseek. You may be right about how you implementation is encoding
the data but you get a clearer understanding of the restrictions imposed
by the standard if you take it at face value -- ftell returns something
you can do nothing with except pass it to fseek.
The net result here is that a subsequent fseek() to the same position
will be wrong.

The standard allows one to fseek using *only* SEEK_SET and the result of a
previous call to ftell (or an offset of 0). If that is all you have done,
and you did not get back to where you expected, then it would seem that
you have a non-compliant library.

If you used you own idea of the stream position (not the result from
ftell) or you used SEEK_END or SEEK_CUR then all bets are off.
So, have I invoked undefined behavior by reading a Unix text file in a
Windows environment? Or is the compiler allowed to return the "wrong"
value as part of an "implementation defined" restriction? Or is this
a bug in the compiler's runtime library?

An example program with what you expect and what happens might make
everything clearer.
 
J

Jack Klein

I recently ran into an "issue" related to text files and ftell/fseek,
and I'd like to know if it's a bug, or simply an annoying, but still
conforming, implementation.

The platform is Windows, where text files use CF+LF (0x0d, 0x0a) to
mark end-of-line. The file in question, however, was in Unix format,
with only LF (0x0a) at the end of each line.

First, does the above situation already invoke "implementation defined"
or "undefined" behavior? Or is it still "defined"?

The problem comes in how ftell() reports the current position. (And,
subsequently fseek()ing back to the same position is wrong.)

Suppose that you have fread() the following 12 characters, starting at
the beginning of the file:

'1' '2' '3' '4' '5' 0x0a '1' '2' '3' '4' '5' 0x0a

(Remember, this file is in Unix format, with a single 0x0a for end-of-
line.)

While you are now at offset 12 within the file, ftell() will return 14,
because it assumes that those '\n' newlines are really CR+LF, and that
the CR was stripped off when read. (Had this file been in Windows format,
you really would be at offset 14 after reading those 12 characters.) For
each 0x0a returned by fread(), ftell() will assume you have advanced two
characters in the file.

The net result here is that a subsequent fseek() to the same position
will be wrong.


So, have I invoked undefined behavior by reading a Unix text file in a
Windows environment? Or is the compiler allowed to return the "wrong"
value as part of an "implementation defined" restriction? Or is this
a bug in the compiler's runtime library?

In addition to Ben's pointing out correctly issues about fseek() and
ftell() limitations, you left out one piece of important information,
namely did you open the file in text or binary mode?

If you open a file in text mode, and it does not actually contain the
format for text files on your platform, you are lying to your compiler
and its library functions. If you lie to your compiler, it will get
its revenge.
 
B

Ben Bacarisse

In addition to Ben's pointing out correctly issues about fseek() and
ftell() limitations, you left out one piece of important information,
namely did you open the file in text or binary mode?

If you open a file in text mode, and it does not actually contain the
format for text files on your platform, you are lying to your compiler
and its library functions. If you lie to your compiler, it will get
its revenge.

Ah. Had I known this I would have simplified my answer to "open it in
binary mode" since from the translations that the OP reports one can tell
that the file is opened as text. I had always assumed that the library
had to keep track of what it had been doing with line endings (both
native and foreign) in order to keep its ftell/fseek promise.

It seems to me that since you can only seek to somewhere you have been
before (by reading) that it would have been possible for the standard to
give the stronger guarantee: that foreign-format text files can be
manipulated just like native ones. If this was not done, why? Is it more
complicated than I imagine?
 
J

Jordan Abel

Ah. Had I known this I would have simplified my answer to "open it in
binary mode" since from the translations that the OP reports one can tell
that the file is opened as text. I had always assumed that the library
had to keep track of what it had been doing with line endings (both
native and foreign) in order to keep its ftell/fseek promise.

It seems to me that since you can only seek to somewhere you have been
before (by reading) that it would have been possible for the standard to
give the stronger guarantee: that foreign-format text files can be
manipulated just like native ones. If this was not done, why? Is it more
complicated than I imagine?

Because "different end-of-line type" is not the only difference between
real text formats that really exist.
 
B

Ben Bacarisse

Because "different end-of-line type" is not the only difference between
real text formats that really exist.

Yes, I get that. I was going to go on and ask how weird could the stream
to file mapping get so that the standard decided not to insist on its
limited ftell/fseek behaviour, but I can see that that would be an
unproductive speculation. I suspect the answer might be that, since it
could never work for certain wide-oriented streams, why bother adding any
further burden on implementations at all? The standard has
fgetpos/fsetpos that can cope with all streams.

While on this topic, I'd like to ask something else. Is fgetpos/fsetpos
a solution for the OP or is there always a problem if a file that is not
in (one of) the host systems text format(s) is opened as a text stream? I
can't find anything in the standard about what sorts of file can
legitimately be opened as text. I had always though that any file could
be opened in text mode -- you might simply get a possibly inappropriate
file to stream mapping but fgetpos/fsetpos would work despite the "lie".
 
S

SM Ryan

# I recently ran into an "issue" related to text files and ftell/fseek,
# and I'd like to know if it's a bug, or simply an annoying, but still
# conforming, implementation.
#
# The platform is Windows, where text files use CF+LF (0x0d, 0x0a) to
# mark end-of-line. The file in question, however, was in Unix format,
# with only LF (0x0a) at the end of each line.
#
# First, does the above situation already invoke "implementation defined"
# or "undefined" behavior? Or is it still "defined"?

If you open the file in binary mode "b", it should deliver each
character in the file as is and ftell/fseek use character offsets.

If you open the file in text mode, it may reinterpret some characters
as line boundaries and return converted characters instead of the
characters actually in the file. ftell returns a magic cookie which
is only defined to be sensible to fseek; some systems define the
cookie (such as a character offset) so that you can also make sense
of it, but that is an implementation feature.

Note that on unix, text and binary mode are the same. On other systems,
text mode usually emulates non-fseeking text mode on Unix, and binary
usually handles the system specific format.
 
K

Kenneth Brody

Ben said:
On Fri, 31 Mar 2006 10:21:47 -0500, Kenneth Brody wrote:
[... Opening Unix-style text file as "text" under Windows ...]
No you should be OK. Will be able to do more things with fseek if you
open the file as a binary file, but opening it as text should also work is
you keep to the restrictions imposed by the standard (see later).
Okay.


Actually, you can't say anything about the numbers. For a text file,
ftell does not give you the offset. It returns a code that can only be
used by fseek. You may be right about how you implementation is encoding
the data but you get a clearer understanding of the restrictions imposed
by the standard if you take it at face value -- ftell returns something
you can do nothing with except pass it to fseek.

Well, the "code" it returns happens to be (what it thinks is) the offset.
(But this isn't really relevent to the issue at hand. It's just a way to
see the issue better.) However, see the next sentence of my original post.
The standard allows one to fseek using *only* SEEK_SET and the result of a
previous call to ftell (or an offset of 0). If that is all you have done,
and you did not get back to where you expected, then it would seem that
you have a non-compliant library.

If you used you own idea of the stream position (not the result from
ftell) or you used SEEK_END or SEEK_CUR then all bets are off.

I use ftell() to save the current position, and later return to it
via an fseek() with SEEK_SET. If the file is in Unix format (with
only 0x0a, rather than the Windows 0x0d/0x0a), then the fseek()
does not go to the same position as when the ftell() was called.

An example program with what you expect and what happens might make
everything clearer.

===============
#include <stdio.h>
#include <sys/types.h>

char mybuf[256];

int main(int argc,char *argv[])
{
int i;
FILE *f = fopen("file.txt","r");
off_t savepos;

if ( f == NULL )
{
perror("fopen");
exit(1);
}

for ( i=1 ; i <= 5 ; i++ )
{
if ( fgets(mybuf,sizeof(mybuf),f) == NULL )
exit(2);
printf("Line %d: %s",i,mybuf);
if ( i == 2 )
{
savepos = ftell(f);
printf(" [position saved]\n");
}
}

fseek(f,savepos,SEEK_SET);
printf("\n[position restored]\n\n");
for ( i=3 ; i <= 5 ; i++ )
{
if ( fgets(mybuf,sizeof(mybuf),f) == NULL )
exit(2);
printf("Line %d: %s",i,mybuf);
}
}
=============== file.txt
This is line one
Line two
Three
This is four
Number five
Line six
=============== Output when file is in Windows format
Line 1: This is line one
Line 2: Line two
[position saved]
Line 3: Three
Line 4: This is four
Line 5: Number five

[position restored]

Line 3: Three
Line 4: This is four
Line 5: Number five
=============== Output when file is in Unix format
Line 1: This is line one
Line 2: Line two
[position saved]
Line 3: Three
Line 4: This is four
Line 5: Number five?

[position restored]

Line 3: two
Line 4: Three
Line 5: This is four
===============

Note how the fseek() returns to the saved position (ie: it starts reading
at the start of line three) when the file is in Windows format. However,
if the file is in Unix format, the fseek() is off by 4 characters.

--
+-------------------------+--------------------+-----------------------------+
| Kenneth J. Brody | www.hvcomputer.com | |
| kenbrody/at\spamcop.net | www.fptech.com | #include <std_disclaimer.h> |
+-------------------------+--------------------+-----------------------------+
Don't e-mail me at: <mailto:[email protected]>
 
K

Kenneth Brody

Jack Klein wrote:
[...]
In addition to Ben's pointing out correctly issues about fseek() and
ftell() limitations, you left out one piece of important information,
namely did you open the file in text or binary mode?

Because it is a text file. However, in this case, it is a text file
that was copied from Unix to Windows w/o EOL conversion.
If you open a file in text mode, and it does not actually contain the
format for text files on your platform, you are lying to your compiler
and its library functions. If you lie to your compiler, it will get
its revenge.

Well, that's what I'm asking -- is the compiler conforming, because it's
allowed to do this because of my "lie", or is the compiler "broken"?

--
+-------------------------+--------------------+-----------------------------+
| Kenneth J. Brody | www.hvcomputer.com | |
| kenbrody/at\spamcop.net | www.fptech.com | #include <std_disclaimer.h> |
+-------------------------+--------------------+-----------------------------+
Don't e-mail me at: <mailto:[email protected]>
 
J

Joe Wright

Kenneth said:
Jack Klein wrote:
[...]
In addition to Ben's pointing out correctly issues about fseek() and
ftell() limitations, you left out one piece of important information,
namely did you open the file in text or binary mode?

Because it is a text file. However, in this case, it is a text file
that was copied from Unix to Windows w/o EOL conversion.
If you open a file in text mode, and it does not actually contain the
format for text files on your platform, you are lying to your compiler
and its library functions. If you lie to your compiler, it will get
its revenge.

Well, that's what I'm asking -- is the compiler conforming, because it's
allowed to do this because of my "lie", or is the compiler "broken"?
There must be something else going on. I modified your program just a
little to get this..

#include <stdio.h>
#include <stdlib.h>

char mybuf[256];

int main(int argc, char *argv[])
{
int i;
FILE *f = fopen("file.txt", "r");
long savepos = 0;
if (f == NULL) {
perror("fopen");
exit(1);
}
for (i = 1; i <= 5; i++) {
if (fgets(mybuf, sizeof(mybuf), f) == NULL)
exit(2);
printf("Line %d: %s", i, mybuf);
if (i == 2) {
savepos = ftell(f);
printf(" [position saved]\n");
}
}

fseek(f, savepos, SEEK_SET);
printf("\n[position restored]\n\n");
for (i = 3; i <= 5; i++) {
if (fgets(mybuf, sizeof(mybuf), f) == NULL)
exit(2);
printf("Line %d: %s", i, mybuf);
}
fclose(f);
return 0;
}

I #include stdlib.h and not sys\types.h because we use exit() and
because ftell() and fseek() take long, not off_t arguments. I close the
file and return 0. I run the above program and get..

Line 1: This is line one
Line 2: Line two
[position saved]
Line 3: Three
Line 4: This is four
Line 5: Number five

[position restored]

Line 3: Three
Line 4: This is four
Line 5: Number five

Then I remove all the '\r' characters from file.txt and run it again and
get, you guessed it..

Line 1: This is line one
Line 2: Line two
[position saved]
Line 3: Three
Line 4: This is four
Line 5: Number five

[position restored]

Line 3: Three
Line 4: This is four
Line 5: Number five

There's a little noise around here sometimes about broken compilers but
its just noise for the most part. The problem is in the code, not the
compiler. In any case I can't duplicate your problem. The program gives
the same output for both DOS and Unix versions of file.txt
 
B

Ben Bacarisse

Ben said:
On Fri, 31 Mar 2006 10:21:47 -0500, Kenneth Brody wrote:
[... Opening Unix-style text file as "text" under Windows ...]
No you should be OK. Will be able to do more things with fseek if you
open the file as a binary file, but opening it as text should also work
is you keep to the restrictions imposed by the standard (see later).

Okay.

It would seem I was hasty in saying this. I seem to remember using a
C library that did not care about line endings in text mode and give
the right behaviour for both native and *some* foreign text file formats
(subject to the standard's restrictions on how one seeks) but I have been
told elsethread that is not guaranteed. I can't verify that from my
reading of the standard, but I am not an expert in it.

It does seem reasonable that text files from "outside" should have to be
opened as binary files -- after all, how odd a text file would you expect
your local C library to understand? An argument could be made, that if
the Unix format is alien enough to be not seekable, then fgets should read
the whole thing as one line. You may have been misled by the C runtime
supporting the line ending for reading, but not for seeking.

BTW. As far as I can tell fgetpos/fsetpos may well do the job. The
standard does not say that there is any restriction on the type of the
stream you may use them on (and they are designed to work with quite
complex multibyte stream encodings so the simple matter of a line ending
is unlikely to throw them off!). I have not had an answer to my question
about this in the thread, but I would be curious to know how your
example program behaves when changed to use them. (On my Linux system,
the C library does no line-end translation on DOS text files to I have
nothing to test -- the file is treated as an untranslated byte stream.)

An example program with what you expect and what happens might make
everything clearer.

===============
#include <stdio.h>
#include <sys/types.h>

char mybuf[256];

int main(int argc,char *argv[])
{
int i;
FILE *f = fopen("file.txt","r");
off_t savepos;

if ( f == NULL )
{
perror("fopen");
exit(1);
}

for ( i=1 ; i <= 5 ; i++ )
{
if ( fgets(mybuf,sizeof(mybuf),f) == NULL )
exit(2);
printf("Line %d: %s",i,mybuf);
if ( i == 2 )
{
savepos = ftell(f);
printf(" [position saved]\n");
}
}

fseek(f,savepos,SEEK_SET);
printf("\n[position restored]\n\n");
for ( i=3 ; i <= 5 ; i++ )
{
if ( fgets(mybuf,sizeof(mybuf),f) == NULL )
exit(2);
printf("Line %d: %s",i,mybuf);
}
}
=============== file.txt
This is line one
Line two
Three
This is four
Number five
Line six
=============== Output when file is in Windows format Line 1: This is
line one
Line 2: Line two
[position saved]
Line 3: Three
Line 4: This is four
Line 5: Number five

[position restored]

Line 3: Three
Line 4: This is four
Line 5: Number five
=============== Output when file is in Unix format Line 1: This is line
one
Line 2: Line two
[position saved]
Line 3: Three
Line 4: This is four
Line 5: Number five?

[position restored]

Line 3: two
Line 4: Three
Line 5: This is four
===============

Note how the fseek() returns to the saved position (ie: it starts
reading at the start of line three) when the file is in Windows format.
However, if the file is in Unix format, the fseek() is off by 4
characters.

The only odd thing here is that ftell and fseek work with long int not
off_t. There are a few other details (include stdlib for exit, return 0
at the end and the now redundant non-standard sys/types.h) but I can't see
they would have any effect.

Correct these and try again if you like but I suspect the real answer
(even if that works as reported elsethread) is that only a file who format
conforms to the local implementation definition of a text file can be
reliably seeked.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top