looping over a big file

M

martian

Hi,

I've a couple of questions regarding the processing of a big text file
(16MB).

1) how does python handle:
for line in big_file:

is big_file all read into memory or one line is read at a time or a buffer
is used or ...?

2) is it possible to advance lines within the loop? The following doesn't
work:
for line in big_file:
line_after = big_file.readline()

the function readline (file pointer) is "out of sync" with the loop (and
this suggests bug_file is not read one line at a time in the loop).

Thanks,
Fernando Martins
 
R

Roy Smith

martian said:
1) how does python handle:


is big_file all read into memory or one line is read at a time or a buffer
is used or ...?

The "right" way to do this is:

for line in file ("filename"):
whatever

The file object returned by file() acts as an iterator. Each time through
the loop, another line is read and returned (I'm sure there is some
block-level buffering going on at a low level).
2) is it possible to advance lines within the loop? The following doesn't
work:

line_after = big_file.readline()

You probably want something like:

for line in file ("filename"):
if skipThisLine:
continue
 
M

Mike Meyer

Roy Smith said:
The "right" way to do this is:

for line in file ("filename"):
whatever

The file object returned by file() acts as an iterator. Each time through
the loop, another line is read and returned (I'm sure there is some
block-level buffering going on at a low level).

I disagree. That's the *convenient* way to do it, and perfectly
acceptable in many situations. But not all Python interpreters will
close the file when for loop ends. Likewise, if you get an exception
during the processing, the file may not get closed properly. Those
things may matter to you, in which case the "right" way is:

data = open("filename")
try:
for line in data:
whatever
finally:
data.close()

Guido has made a pronouncement on open vs. file. I think he prefers
open for opening files, and file for type testing, but may well be
wrong. I don't think it's critical.

<mike
 
M

Michael Hoffman

Mike said:
I disagree. That's the *convenient* way to do it, and perfectly
acceptable in many situations. But not all Python interpreters will
close the file when for loop ends. Likewise, if you get an exception
during the processing, the file may not get closed properly. Those
things may matter to you, in which case the "right" way is:

data = open("filename")
try:
for line in data:
whatever
finally:
data.close()

Guido has made a pronouncement on open vs. file. I think he prefers
open for opening files, and file for type testing, but may well be
wrong. I don't think it's critical.

He has said that open() may be used for things other than files in the
future. So if you want to be sure you're opening a file, use file().

<wink>
 
P

Peter Hansen

Michael said:
He has said that open() may be used for things other than files in the
future. So if you want to be sure you're opening a file, use file().

Probably this is the same sort of things as "if you want to be sure your
function is working with an integer, you have to test whether it is an
integer" (or use a statically typed language).

Which is advice that is generally rebutted around here with comments
about "duck typing" (as in, if it acts like an integer, then stop
worrying about what it actually is).

If open() can ever return things other than files, it seems likely it
will do so only under conditions that make it pretty much safe to assume
that existing code will continue to operate "as expected" (note: not
"always with a file").

I'm not going to try to picture just how this might happen, but I could
imagine, for example, some kind of support for protocol prefixes (ala
"http:" or "ftp:"), or perhaps some sort of support for encrypted or
compressed data. Or maybe it would require a prior call to some
function to enable the magic that lets open() return non-files.

If any of that is reasonable, then using open() is actually the better
approach to ensuring your code "does the right thing" in the future, and
"file" should still be used in the rare case where you actually want to
test whether something is a particular type of thing.

-Peter
 
T

Terry Hancock

If open() can ever return things other than files, it seems likely it
will do so only under conditions that make it pretty much safe to assume
that existing code will continue to operate "as expected" (note: not
"always with a file").

WHEN it returns things other than files. Like a StringIO object,
which can be quite handy. True, it won't be a "big file", but it'd
be nice if the same code would tolerate it. I've used this with
e.g. PIL quite a bit when working with Zope, because it isn't
really desireable to have to write the file out to disk and read
it back when you've already got it in memory.

Quack! ;-)
Terry
 
A

Asun Friere

Jp said:
fileIter = iter(big_file)
for line in fileIter:
line_after = fileIter.next()

Don't mix iterating with any other file methods, since it will confuse the buffering scheme.

Isn't a file an iterable already?

[GCC 3.3.3 20040412 (Red Hat Linux 3.3.3-7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.True
 
A

Asun Friere

sorry lost the first line in pasting:
Python 2.4.1 (#1, Jun 21 2005, 12:38:55)
:/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,260
Messages
2,571,308
Members
47,963
Latest member
NancyRyl51

Latest Threads

Top