itertools, functools, file enhancement ideas

Paul Rubin · Apr 7, 2007

I just had to write some programs that crunched a lot of large files,
both text and binary. As I use iterators more I find myself wishing
for some maybe-obvious enhancements:

1. File iterator for blocks of chars:

f = open('foo')
for block in f.iterchars(n=1024): ...

iterates through 1024-character blocks from the file. The default iterator
which loops through lines is not always a good choice since each line can
use an unbounded amount of memory. Default n in the above should be 1 char.

2. wrapped file openers:
There should be functions (either in itertools, builtins, the sys
module, or whereever) that open a file, expose one of the above
iterators, then close the file, i.e.
def file_lines(filename):
with f as open(filename):
for line in f:
yield line
so you can say

for line in file_lines(filename):
crunch(line)

The current bogus idiom is to say "for line in open(filename)" but
that does not promise to close the file once the file is exhausted
(part of the motivation of the new "with" statement). There should
similarly be "file_chars" which uses the n-chars iterator instead of
the line iterator.

3. itertools.ichain:
yields the contents of each of a sequence of iterators, i.e.:
def ichain(seq):
for s in seq:
for t in s:
yield t
this is different from itertools.chain because it lazy-evaluates its
input sequence. Example application:

all_filenames = ['file1', 'file2', 'file3']
# loop through all the files crunching all lines in each one
for line in (ichain(file_lines(x) for x in all_filenames)):
crunch(x)

4. functools enhancements (Haskell-inspired):
Let f be a function with 2 inputs. Then:
a) def flip(f): return lambda x,y: f(y,x)
b) def lsect(x,f): return partial(f,x)
c) def rsect(f,x): return partial(flip(f), x)

lsect and rsect allow making what Haskell calls "sections". Example:
# sequence of all squares less than 100
from operator import lt
s100 = takewhile(rsect(lt, 100), (x*x for x in count()))

Paul Rubin · Apr 7, 2007

Paul Rubin said:
# loop through all the files crunching all lines in each one
for line in (ichain(file_lines(x) for x in all_filenames)):
crunch(x)

supposed to say crunch(line) of course.

Alex Martelli · Apr 8, 2007

Paul Rubin said:
I just had to write some programs that crunched a lot of large files,
both text and binary. As I use iterators more I find myself wishing
for some maybe-obvious enhancements:

1. File iterator for blocks of chars:

f = open('foo')
for block in f.iterchars(n=1024): ...

iterates through 1024-character blocks from the file. The default iterator
which loops through lines is not always a good choice since each line can
use an unbounded amount of memory. Default n in the above should be 1 char.

the simple way (letting the file object deal w/buffering issues):

def iterchars(f, n=1):
while True:
x = f.read(n)
if not x: break
yield x

the fancy way (doing your own buffering) is left as an exercise for the
reader. I do agree it would be nice to have in some module.

2. wrapped file openers:
There should be functions (either in itertools, builtins, the sys
module, or whereever) that open a file, expose one of the above
iterators, then close the file, i.e.
def file_lines(filename):
with f as open(filename):
for line in f:
yield line
so you can say

for line in file_lines(filename):
crunch(line)

The current bogus idiom is to say "for line in open(filename)" but
that does not promise to close the file once the file is exhausted
(part of the motivation of the new "with" statement). There should
similarly be "file_chars" which uses the n-chars iterator instead of
the line iterator.

I'm +/-0 on this one vs the idioms:

with open(filename) as f:
for line in f: crunch(line)

with open(filename, 'rb') as f:
for block in iterchars(f): crunch(block)

Making two lines into one is a weak use case for a stdlib function.

3. itertools.ichain:
yields the contents of each of a sequence of iterators, i.e.:
def ichain(seq):
for s in seq:
for t in s:
yield t
this is different from itertools.chain because it lazy-evaluates its
input sequence. Example application:

all_filenames = ['file1', 'file2', 'file3']
# loop through all the files crunching all lines in each one
for line in (ichain(file_lines(x) for x in all_filenames)):
crunch(x)

Yes, subtle but important distinction.

4. functools enhancements (Haskell-inspired):
Let f be a function with 2 inputs. Then:
a) def flip(f): return lambda x,y: f(y,x)
b) def lsect(x,f): return partial(f,x)
c) def rsect(f,x): return partial(flip(f), x)

lsect and rsect allow making what Haskell calls "sections". Example:
# sequence of all squares less than 100
from operator import lt
s100 = takewhile(rsect(lt, 100), (x*x for x in count()))

Looks like they'd be useful, but I'm not sure about limiting them to
working with 2-argument functions only.

Alex

Paul Rubin · Apr 8, 2007

I'm +/-0 on this one vs the idioms:
with open(filename) as f:
for line in f: crunch(line)

Making two lines into one is a weak use case for a stdlib function.

Well, the inspiration is being able to use the iterator in another
genexp:

for line in (ichain(file_lines(x) for x in all_filenames)):
crunch(line)

so it's making more than two lines into one, and just flows more

naturally said:
Looks like they'd be useful, but I'm not sure about limiting them to
working with 2-argument functions only.

I'm not sure how to generalize them but if there's an obvious correct
way to do it, that sounds great

.

Also forgot to include the obvious:

def compose(f,g):
return lambda(*args,**kw): f(g(*args,**kw))

Alexander Schmolck · Apr 8, 2007

Looks like they'd be useful, but I'm not sure about limiting them to
working with 2-argument functions only.

How's

from mysterymodule import resect
from operator import lt
takewhile(rsect(lt, 100), (x*x for x in count()))

better than

takewhile(lambda x:x<100, (x*x for x in count()))

Apart from boiler-plate creation and code-obfuscation purposes?

'as

rdhettinger · Apr 8, 2007

[Paul Rubin]

1. File iterator for blocks of chars:

f = open('foo')
for block in f.iterchars(n=1024): ...

for block in iter(partial(f.read, 1024), ''): ...

iterates through 1024-character blocks from the file. The default iterator
a) def flip(f): return lambda x,y: f(y,x)

Curious resemblance to:

itemgetter(1,0)

Raymond

Paul Rubin · Apr 8, 2007

for block in iter(partial(f.read, 1024), ''): ...

Hmm, nice. I keep forgetting about that feature of iter. It also came
up in a response to my queue example from another post.

Curious resemblance to:
itemgetter(1,0)

Not sure I understand that.

Klaas · Apr 10, 2007

(e-mail address removed) writes:

Not sure I understand that.

I think he read it as lambda (x, y): (y, x)

More interesting would be functools.rshift/lshift, that would rotate
the positional arguments (with wrapping)

def f(a, b, c, d, e):
....
rshift(f, 3) --> g, where g(c, d, e, a, b) == f(a, b, c, d, e)

Still don't see much advantage over writing a lambda (except perhaps
speed).

-Mike

Paul Rubin · Apr 11, 2007

Klaas said:
Still don't see much advantage over writing a lambda (except perhaps
speed).

Well, it's partly a matter of avoiding boilerplate, especially with
the lambdaphobia that many Python users seem to have.

Into itertools	5	Apr 26, 2009
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
Argument Decorators Enhancement?	1	May 16, 2006
Pickle MemoryError - any ideas?	3	Jul 20, 2010
new itertools functions in Python 2.6	4	Jul 14, 2008
I made a blockchain and want to make a cryptocurrency, but my code doesn't verify hash of each block	2	Jun 2, 2024
Python battle game help	2	Feb 23, 2023
a few extensions for the itertools	23	Nov 19, 2006

itertools, functools, file enhancement ideas

Paul Rubin

Paul Rubin

Alex Martelli

Paul Rubin

Alexander Schmolck

rdhettinger

Paul Rubin

Klaas

Paul Rubin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads