mixing for x in file: and file.readline

Russell E. Owen · Sep 11, 2003

At one time, mixing for x in file and readline was dangerous. For
example:

for line in file:
# read some lines from a file, then break
nextline = readline() # bad

would not do what a naive user might expect because the file iterator
buffered data and readline did not read from that buffer. Hence the call
to readline might unexpectedly skip some lines.

I stumbled across this the hard way, but am wondering if it's still
present in Python 2.3. I thought I'd seen it documented recently, but
looking through the description of the file object in the Python Library
Reference, I didn't see it.

Anyone know if it's still an issue? If so, anyone have any idea how hard
it would be to fix? I'm willing to work on a patch, but would probably
need some help. And if experts have already determined it's too hard,
and are willing to expain, I'd love some idea of why that is.

-- Russell

John J. Lee · Sep 12, 2003

Russell E. Owen said:
At one time, mixing for x in file and readline was dangerous. For
example:

for line in file:
# read some lines from a file, then break
nextline = readline() # bad

would not do what a naive user might expect because the file iterator
buffered data and readline did not read from that buffer. Hence the call
to readline might unexpectedly skip some lines.

I stumbled across this the hard way, but am wondering if it's still
present in Python 2.3. I thought I'd seen it documented recently, but
looking through the description of the file object in the Python Library
Reference, I didn't see it.

There was a thread-fragment about this a while back. See the message
from Steven Taschuk a few messages past this one:

http://www.google.com/groups?hl=en&...-1&[email protected]&lr=&num=30&hl=en

http://tinyurl.com/n2cc

Anyone know if it's still an issue? If so, anyone have any idea how hard

[...]

Was fixed in 2.3, maybe in 2.2.3 also (not sure).

John

Oren Tirosh · Sep 12, 2003

At one time, mixing for x in file and readline was dangerous. For
example:

for line in file:
# read some lines from a file, then break
nextline = readline() # bad

would not do what a naive user might expect because the file iterator
buffered data and readline did not read from that buffer. Hence the call
to readline might unexpectedly skip some lines.

I stumbled across this the hard way, but am wondering if it's still
present in Python 2.3.

Yes.

After you start reading a file with 'for' or iter() the current file
position is undefined unless you continue to the end of the file. This
means that once you start you shouldn't use the read(), readline() or
tell() methods unless you first seek() to a well-defined position.

The readline() and read() methods use the buffered I/O operations supplied
by the underlying C library. You can safely intermix read() and realine()
as well as tell()ing and seek()ing around without encountering any
unexpected behavior. You can even mix read operations on the same file
from Python code and stdio calls from an extension module (after getting
the FILE* object using PyFile_AsFile).

File iteration uses its own buffering for performance. Guido has declared
that "for line in fileobj:" should always be the fastest way to read an
entire file line by line. You just can't do that with the crappy stdio
implementations out there without adding your own buffering layer. Once
you do that it is out of sync with the FILE* object's idea of the current
file position.

In Python 2.2 if you break in the middle of the loop the temporary
iterator object (xreadlines) is lost along with its readahead buffer,
leaving you at an unknown file position. The only things you can do are
to close the file or seek. In Python 2.3 the file object IS an iterator
(rather than HAS and iterator) so while the current file position is
undefined from a read/readline/tell point of view the iterator state is
still consistent so you can immediately use it in another for loop to
continue from the same position or even call its next() method directly.

Anyone know if it's still an issue? If so, anyone have any idea how hard
it would be to fix? I'm willing to work on a patch, but would probably
need some help. And if experts have already determined it's too hard,
and are willing to expain, I'd love some idea of why that is.

Really fixing it amounts to reimplementing the entire I/O layer of
Python with a different strategy and thoroughly testing on multiple
platforms.

It's possible to hide the problem in most cases by making read and
readline use the iteration readahead buffer if it's attached to the file
object and stdio if it isn't. I don't think it's a good idea. It will
require some hairy code and and seems susceptible to subtle bugs and
corner cases.

Another alternative it to make read and readline fail noisily after
iteration starts (unless cleared by seek())

Oren

John J. Lee · Sep 12, 2003

Oren Tirosh said:
At one time, mixing for x in file and readline was dangerous. For
example:

Click to expand...

[...]

Yes.

Click to expand...

[...]
In Python 2.2 if you break in the middle of the loop the temporary
iterator object (xreadlines) is lost along with its readahead buffer,
leaving you at an unknown file position. The only things you can do are
to close the file or seek. In Python 2.3 the file object IS an iterator
(rather than HAS and iterator) so while the current file position is
undefined from a read/readline/tell point of view the iterator state is
still consistent so you can immediately use it in another for loop to
continue from the same position or even call its next() method directly.

[...]

Oh, sorry for the misinformation -- I thought the repeated-iteration
and mixing-iteration-with-readline issues were the same, but clearly
not.

John

Russell E. Owen · Sep 12, 2003

(Oren points out that it's still a problem in Python 2.3 and after some
interesting and gory detail goes on to say...)

Really fixing it amounts to reimplementing the entire I/O layer of
Python with a different strategy and thoroughly testing on multiple
platforms.

It's possible to hide the problem in most cases by making read and
readline use the iteration readahead buffer if it's attached to the file
object and stdio if it isn't. I don't think it's a good idea. It will
require some hairy code and and seems susceptible to subtle bugs and
corner cases.

I agree that fixing read would probably be too messy to justify.

But it seems to me that a simple reimplementation of readline() would
work fine:

def readline(self):
try:
return self.next()
except StopIteration
return ""

That's basically the way I ended up working around the problem (but I
didn't try to modify any classes). I do see two issues with that fix:
- existing code (if any) that mixes readlines and read would be harmed
- it may not be efficient enough (even implemented in C)

Another alternative it to make read and readline fail noisily after
iteration starts (unless cleared by seek())

If readlines cannot be fixed, this might be worth doing since I think
it's a common thing to want to mix readlines and iteration. If read is
the only issue, I suspect adding a warning to the documentation for file
method "read" would suffice.

I'm wondering where the problem is discussed in the manual. I'm pretty
sure I saw it recently, but when I read about file methods I saw nothing
about it.

-- Russell

Russell E. Owen · Sep 12, 2003

The seek workaround turns out to be very challenging, unless I'm missing
something. seek(0, 1) doesn't do anything -- no surprise, but it was
worth a try. Apparently the right thing is
seek(-n, 1) where n = # of characters in the iterator's buffer
but I havn't found any way of querying that information.

(The thought of using absolute positioning is appalling -- one would
have to keep track of how many characters had been returned by the
iterator).

A possible fix for read is to have it automatically do the seek
mentioned above (if the iteration buffer is nonempty). That'd work for
readline as well, but I still prefer the idea of having it use the
itearator -- it seems a lot simpler.

Comments?

-- Russell

Oren Tirosh · Sep 14, 2003

.

I agree that fixing read would probably be too messy to justify.

But it seems to me that a simple reimplementation of readline() would
work fine:

def readline(self):
try:
return self.next()
except StopIteration
return ""

That's basically the way I ended up working around the problem (but I
didn't try to modify any classes). I do see two issues with that fix:
- existing code (if any) that mixes readlines and read would be harmed
- it may not be efficient enough (even implemented in C)

It will be very efficient. In fact, it will be faster than the current
readline implementation because it will use the readahead buffer. But
the problem is more than just mixing readline() and read(). Mixing
readline() and tell() will also be broken. It is valid (and useful) to
read a file line by line, store a tell() offset and later seek() back to
the same line. It works even if the file is in text mode doing CRLF->LF
conversions.

If readlines cannot be fixed, this might be worth doing since I think
it's a common thing to want to mix readlines and iteration. If read is
the only issue, I suspect adding a warning to the documentation for file
method "read" would suffice.

The problem is that it will work on, say, Python 2.3.1 but fail silently
on earlier versions. Why not just use next() instead of readline()?
Because catching StopIteration takes a little more typing than checking
an empty string?

I'm wondering where the problem is discussed in the manual. I'm pretty
sure I saw it recently, but when I read about file methods I saw nothing
about it.

I believe it's not documented clearly enough. Docpatch time?

Oren

Listen for changes in variable (alsaaudio.Mixer(x,x).getvolume(x)	3	Oct 24, 2012
How to disable RTLD_NOW for Python 2.7.x dlopen() in Mac OS X Mavericks?	2	Nov 14, 2013
mimetypes.guess_type broken in windows on py2.7 and python 3.X	0	Sep 26, 2012
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
Seeking co-founders for my company.	3	Sep 8, 2024
Twitter Bot for Series recommendations help please	1	Oct 2, 2024
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
WIN32 - Update Text in a Window in order to show its size in Pixels and coordinates	0	Oct 4, 2023

mixing for x in file: and file.readline

Russell E. Owen

John J. Lee

Oren Tirosh

John J. Lee

Russell E. Owen

Russell E. Owen

Oren Tirosh

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads