mixing for x in file: and file.readline

R

Russell E. Owen

At one time, mixing for x in file and readline was dangerous. For
example:

for line in file:
# read some lines from a file, then break
nextline = readline() # bad

would not do what a naive user might expect because the file iterator
buffered data and readline did not read from that buffer. Hence the call
to readline might unexpectedly skip some lines.

I stumbled across this the hard way, but am wondering if it's still
present in Python 2.3. I thought I'd seen it documented recently, but
looking through the description of the file object in the Python Library
Reference, I didn't see it.

Anyone know if it's still an issue? If so, anyone have any idea how hard
it would be to fix? I'm willing to work on a patch, but would probably
need some help. And if experts have already determined it's too hard,
and are willing to expain, I'd love some idea of why that is.

-- Russell
 
J

John J. Lee

Russell E. Owen said:
At one time, mixing for x in file and readline was dangerous. For
example:

for line in file:
# read some lines from a file, then break
nextline = readline() # bad

would not do what a naive user might expect because the file iterator
buffered data and readline did not read from that buffer. Hence the call
to readline might unexpectedly skip some lines.

I stumbled across this the hard way, but am wondering if it's still
present in Python 2.3. I thought I'd seen it documented recently, but
looking through the description of the file object in the Python Library
Reference, I didn't see it.

There was a thread-fragment about this a while back. See the message
from Steven Taschuk a few messages past this one:

http://www.google.com/groups?hl=en&...-1&[email protected]&lr=&num=30&hl=en

http://tinyurl.com/n2cc


Anyone know if it's still an issue? If so, anyone have any idea how hard
[...]

Was fixed in 2.3, maybe in 2.2.3 also (not sure).


John
 
O

Oren Tirosh

At one time, mixing for x in file and readline was dangerous. For
example:

for line in file:
# read some lines from a file, then break
nextline = readline() # bad

would not do what a naive user might expect because the file iterator
buffered data and readline did not read from that buffer. Hence the call
to readline might unexpectedly skip some lines.

I stumbled across this the hard way, but am wondering if it's still
present in Python 2.3.

Yes.

After you start reading a file with 'for' or iter() the current file
position is undefined unless you continue to the end of the file. This
means that once you start you shouldn't use the read(), readline() or
tell() methods unless you first seek() to a well-defined position.

The readline() and read() methods use the buffered I/O operations supplied
by the underlying C library. You can safely intermix read() and realine()
as well as tell()ing and seek()ing around without encountering any
unexpected behavior. You can even mix read operations on the same file
from Python code and stdio calls from an extension module (after getting
the FILE* object using PyFile_AsFile).

File iteration uses its own buffering for performance. Guido has declared
that "for line in fileobj:" should always be the fastest way to read an
entire file line by line. You just can't do that with the crappy stdio
implementations out there without adding your own buffering layer. Once
you do that it is out of sync with the FILE* object's idea of the current
file position.

In Python 2.2 if you break in the middle of the loop the temporary
iterator object (xreadlines) is lost along with its readahead buffer,
leaving you at an unknown file position. The only things you can do are
to close the file or seek. In Python 2.3 the file object IS an iterator
(rather than HAS and iterator) so while the current file position is
undefined from a read/readline/tell point of view the iterator state is
still consistent so you can immediately use it in another for loop to
continue from the same position or even call its next() method directly.
Anyone know if it's still an issue? If so, anyone have any idea how hard
it would be to fix? I'm willing to work on a patch, but would probably
need some help. And if experts have already determined it's too hard,
and are willing to expain, I'd love some idea of why that is.

Really fixing it amounts to reimplementing the entire I/O layer of
Python with a different strategy and thoroughly testing on multiple
platforms.

It's possible to hide the problem in most cases by making read and
readline use the iteration readahead buffer if it's attached to the file
object and stdio if it isn't. I don't think it's a good idea. It will
require some hairy code and and seems susceptible to subtle bugs and
corner cases.

Another alternative it to make read and readline fail noisily after
iteration starts (unless cleared by seek())

Oren
 
J

John J. Lee

Oren Tirosh said:
At one time, mixing for x in file and readline was dangerous. For
example:
[...]
[...]
In Python 2.2 if you break in the middle of the loop the temporary
iterator object (xreadlines) is lost along with its readahead buffer,
leaving you at an unknown file position. The only things you can do are
to close the file or seek. In Python 2.3 the file object IS an iterator
(rather than HAS and iterator) so while the current file position is
undefined from a read/readline/tell point of view the iterator state is
still consistent so you can immediately use it in another for loop to
continue from the same position or even call its next() method directly.
[...]

Oh, sorry for the misinformation -- I thought the repeated-iteration
and mixing-iteration-with-readline issues were the same, but clearly
not.


John
 
R

Russell E. Owen

(Oren points out that it's still a problem in Python 2.3 and after some
interesting and gory detail goes on to say...)
Really fixing it amounts to reimplementing the entire I/O layer of
Python with a different strategy and thoroughly testing on multiple
platforms.

It's possible to hide the problem in most cases by making read and
readline use the iteration readahead buffer if it's attached to the file
object and stdio if it isn't. I don't think it's a good idea. It will
require some hairy code and and seems susceptible to subtle bugs and
corner cases.

I agree that fixing read would probably be too messy to justify.

But it seems to me that a simple reimplementation of readline() would
work fine:

def readline(self):
try:
return self.next()
except StopIteration
return ""

That's basically the way I ended up working around the problem (but I
didn't try to modify any classes). I do see two issues with that fix:
- existing code (if any) that mixes readlines and read would be harmed
- it may not be efficient enough (even implemented in C)
Another alternative it to make read and readline fail noisily after
iteration starts (unless cleared by seek())

If readlines cannot be fixed, this might be worth doing since I think
it's a common thing to want to mix readlines and iteration. If read is
the only issue, I suspect adding a warning to the documentation for file
method "read" would suffice.

I'm wondering where the problem is discussed in the manual. I'm pretty
sure I saw it recently, but when I read about file methods I saw nothing
about it.

-- Russell
 
R

Russell E. Owen

The seek workaround turns out to be very challenging, unless I'm missing
something. seek(0, 1) doesn't do anything -- no surprise, but it was
worth a try. Apparently the right thing is
seek(-n, 1) where n = # of characters in the iterator's buffer
but I havn't found any way of querying that information.

(The thought of using absolute positioning is appalling -- one would
have to keep track of how many characters had been returned by the
iterator).

A possible fix for read is to have it automatically do the seek
mentioned above (if the iteration buffer is nonempty). That'd work for
readline as well, but I still prefer the idea of having it use the
itearator -- it seems a lot simpler.

Comments?

-- Russell
 
O

Oren Tirosh

.
I agree that fixing read would probably be too messy to justify.

But it seems to me that a simple reimplementation of readline() would
work fine:

def readline(self):
try:
return self.next()
except StopIteration
return ""

That's basically the way I ended up working around the problem (but I
didn't try to modify any classes). I do see two issues with that fix:
- existing code (if any) that mixes readlines and read would be harmed
- it may not be efficient enough (even implemented in C)

It will be very efficient. In fact, it will be faster than the current
readline implementation because it will use the readahead buffer. But
the problem is more than just mixing readline() and read(). Mixing
readline() and tell() will also be broken. It is valid (and useful) to
read a file line by line, store a tell() offset and later seek() back to
the same line. It works even if the file is in text mode doing CRLF->LF
conversions.
If readlines cannot be fixed, this might be worth doing since I think
it's a common thing to want to mix readlines and iteration. If read is
the only issue, I suspect adding a warning to the documentation for file
method "read" would suffice.

The problem is that it will work on, say, Python 2.3.1 but fail silently
on earlier versions. Why not just use next() instead of readline()?
Because catching StopIteration takes a little more typing than checking
an empty string?
I'm wondering where the problem is discussed in the manual. I'm pretty
sure I saw it recently, but when I read about file methods I saw nothing
about it.

I believe it's not documented clearly enough. Docpatch time?

Oren
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,164
Messages
2,570,901
Members
47,439
Latest member
elif2sghost

Latest Threads

Top