Lazy "for line in f" ?

A

Alexandre Ferrieux

Hi,

I'm a total newbie in Python, but did give quite a try to the
documentation before coming here.
Sorry if I missed the obvious.

The Tutorial says about the "for line in f" idiom that it is "space-
efficient".
Short of further explanation, I interpret this as "doesn't read the
whole file before spitting out lines".
In other words, I would say "lazy". Which would be a Good Thing, a
much nicer idiom than the usual while loop calling readline()...

But when I use it on the standard input, be it the tty or a pipe, it
seems to wait for EOF before yielding the first line.

So, is it lazy or not ? Is there some external condition that may
trigger one behavior or the other ? If not, why is it said "space
efficient" ?

TIA,

-Alex
 
C

Christoph Haas

I'm a total newbie in Python, but did give quite a try to the
documentation before coming here.
Sorry if I missed the obvious.

The Tutorial says about the "for line in f" idiom that it is "space-
efficient".
Short of further explanation, I interpret this as "doesn't read the
whole file before spitting out lines".

Correct. It reads one line at a time (as an "iterator") and returns it.
In other words, I would say "lazy". Which would be a Good Thing, a
much nicer idiom than the usual while loop calling readline()...

The space-efficiency is similar. The faux pas would rather to read the
whole file with readlines().
But when I use it on the standard input, be it the tty or a pipe, it
seems to wait for EOF before yielding the first line.

Standard input is a weird thing in Python. Try sending two EOFs
(Ctrl-D). There is some internal magic with two loops checking for EOF.
It's submitted as a bug report bug the developers denied a solution.
Otherwise it's fine. In a pipe you shouldn't even notice.

Christoph
 
M

Miles

The Tutorial says about the "for line in f" idiom that it is "space-
efficient".
Short of further explanation, I interpret this as "doesn't read the
whole file before spitting out lines".
In other words, I would say "lazy". Which would be a Good Thing, a
much nicer idiom than the usual while loop calling readline()...

But when I use it on the standard input, be it the tty or a pipe, it
seems to wait for EOF before yielding the first line.

It doesn't read the entire file, but it does use internal buffering
for performance. On my system, it waits until it gets about 8K of
input before it yields anything. If you need each line as it's
entered at a terminal, you're back to the while/readline (or
raw_input) loop.

-Miles
 
A

Alexandre Ferrieux

It doesn't read the entire file, but it does use internal buffering
for performance. On my system, it waits until it gets about 8K of
input before it yields anything. If you need each line as it's
entered at a terminal, you're back to the while/readline (or
raw_input) loop.

How frustrating ! Such a nice syntax for such a crippled semantics...

Of course, I guess it is trivial to write another iterator doing
exactly what I want.
But nonetheless, it is disappointing not to have it with the standard
file handles.
And speaking about optimization, I doubt blocking on a full buffer
gains much.
For decades, libc's fgets() has been doing it properly (block-
buffering when data come swiftly, but yielding lines as soon as they
are complete)... Why is the Python library doing this ?

-Alex
 
S

Steve Holden

Alexandre said:
How frustrating ! Such a nice syntax for such a crippled semantics...

Of course, I guess it is trivial to write another iterator doing
exactly what I want.
But nonetheless, it is disappointing not to have it with the standard
file handles.
And speaking about optimization, I doubt blocking on a full buffer
gains much.
For decades, libc's fgets() has been doing it properly (block-
buffering when data come swiftly, but yielding lines as soon as they
are complete)... Why is the Python library doing this ?
What makes you think Python doesn't use the platform fgets()? As a
matter of policy the Python library offers as thin as possbile a shim
over the C standard library when this is practical - as it is with "for
line in f:". But in the case of file.next() (the file method called to
iterate over the contents) it will actually use getc_unlocked() on
platforms that offer it, though you can override that configuration
feature by setting USE_FGETS_IN_GETLINE,

It's probably more to do with the buffering. If whatever is driving the
file is using buffering itself, then it really doesn't matter what the
Python library does, it will still have to wait until the sending buffer
fills before it can get any data at all.

Try running stdin unbuffered (use python -u) and see if that makes any
difference. It should, in the shell-driven case, for example.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------
 
A

Alexandre Ferrieux

What makes you think Python doesn't use the platform fgets()?

The fact that it does that extra layer of buffering. Stdio is already
buffered, duplicating this is useless.
... in the case of file.next() (the file method called to
iterate over the contents) it will actually use getc_unlocked() on
platforms that offer it, though you can override that configuration
feature by setting USE_FGETS_IN_GETLINE

Does nothing. And anyway, stdio's getc() does not stubbornly block on
8k either.
So switching from getc to gets seems orthogonal to the problem.
It's probably more to do with the buffering. If whatever is driving the
file is using buffering itself, then it really doesn't matter what the
Python library does, it will still have to wait until the sending buffer
fills before it can get any data at all.

Nonsense. In all three cases of pipe, socket, terminal, I control the
writer and make sure that it writes in unbuffered manner. To convince
you, here is an strace of the Python process while I type random lines
like "fdsfdsfds":

read(0, "sdfsdf\n", 8192) = 7
read(0, "sdfds\n", 7168) = 6

which proves that the Python process actually gets the lines one by
one, but buffers them internally... for much too long. Sigh.
Try running stdin unbuffered (use python -u) and see if that makes any
difference. It should, in the shell-driven case, for example.

No effect. As a matter of fact, -u is documented as affecting only
output (stdout and stderr).

So I'll reiterate the question: *why* does the Python library add that
extra layer of (hard-headed) buffering on top of stdio's ?

-Alex
 
P

Paul Rubin

Alexandre Ferrieux said:
So I'll reiterate the question: *why* does the Python library add that
extra layer of (hard-headed) buffering on top of stdio's ?

readline?
 
D

Duncan Booth

Alexandre Ferrieux said:
I know readline() doesn't have this problem. I'm asking why the file
iterator does.
Here's a program which can create a large file and either read it with
readline or iterate over the lines. Output from various runs should
answer your question.

The extra buffering means that iterating over a file is about 3 times
faster than repeatedly calling readline.

C:\Temp>test.py create 1000000
create file
Time taken=7.28 seconds

C:\Temp>test.py readline
readline
Time taken=1.03 seconds

C:\Temp>test.py iterate
iterate
Time taken=0.38 seconds

C:\Temp>test.py create 10000000
create file
Time taken=47.28 seconds

C:\Temp>test.py readline
readline
Time taken=10.39 seconds

C:\Temp>test.py iterate
iterate
Time taken=3.58 seconds


------- test.py ------------
import time, sys

NLINES = 10
def create():
print "create file"
f = open('testfile.txt', 'w')
for i in range(NLINES):
print >>f, "This is a test file with a lot of lines"
f.close()

def readline():
print "readline"
f = open('testfile.txt', 'r')
while 1:
line = f.readline()
if not line:
break
f.close()

def iterate():
print "iterate"
f = open('testfile.txt', 'r')
for line in f:
pass
f.close()

def doit(fn):
start = time.time()
fn()
end = time.time()
print "Time taken=%0.2f seconds" % (end-start)

if __name__=='__main__':
if len(sys.argv) >= 3:
NLINES = int(sys.argv[2])

if sys.argv[1]=='create':
doit(create)
elif sys.argv[1]=='readline':
doit(readline)
elif sys.argv[1]=='iterate':
doit(iterate)

----------------------------
 
A

Alexandre Ferrieux

The extra buffering means that iterating over a file is about 3 times
faster than repeatedly calling readline.

while 1:
line = f.readline()
if not line:
break

for line in f:
pass

Surely you'll notice that the comparison is spoilt by the fact that
the readline version needs an interpreted test each turn around.
A more interesting test would be the C-implemented iterator, just
calling fgets() (the thin layer policy) without extra 8k-blocking.

-Alex
 
D

Duncan Booth

Alexandre Ferrieux said:
Surely you'll notice that the comparison is spoilt by the fact that
the readline version needs an interpreted test each turn around.
A more interesting test would be the C-implemented iterator, just
calling fgets() (the thin layer policy) without extra 8k-blocking.
No, I believe the comparison is perfectly fair. You need the extra test
for the readline version whatever you do, and you don't need it for the
iterator.

If you insist, you can add an identical 'if not line: break' into the
iterator version as well: it adds another 10% onto the iterator runtime
which is still nearly a factor of 3 faster than the readline version,
but then you aren't comparing equivalent code.

Alternatively you can knock a chunk off the time for the readline loop
by writing it as:

while f.readline():
pass

or even:

read = f.readline
while read():
pass

which gets it down from 10.3 to 9.0 seconds. It's 'fair' in your book
since it avoids all the extra interpreter overhead of attribute lookup
and a separate test, but it does make it a touch hard to do anything
useful with the actual data.

Whatever, the iterator makes the code both cleaner and faster. It is at
the expense of not being suitable for interactive sessions, or in some
cases pipes, but for those situations you can continue to use readline
and the extra overhead in runtime will not likely be noticable.
 
A

Alexandre Ferrieux

Whatever, the iterator makes the code both cleaner and faster. It is at
the expense of not being suitable for interactive sessions, or in some
cases pipes, but for those situations you can continue to use readline
and the extra overhead in runtime will not likely be noticable.

But *why* is it so ? If Python calls fgets() which already has
internal buffering, why is the extra buffering gaining so much ?

-Alex
 
D

Duncan Booth

Duncan Booth said:
or even:

read = f.readline
while read():
pass

Oops, I forgot the other obvious variant on this, which has the benefit of
getting rid of the test I said was 'required' while still leaving the data
accessible:

for line in iter(f.readline, ''):
pass

Takes 8.89 seconds (best of 3 runs) versus 3.56 (best of 3) for the
similar:

for line in f:
pass

So readline is 250% slower at best, and only then if you remember the
obscure use of iter. :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top