itertools, functools, file enhancement ideas

P

Paul Rubin

I just had to write some programs that crunched a lot of large files,
both text and binary. As I use iterators more I find myself wishing
for some maybe-obvious enhancements:

1. File iterator for blocks of chars:

f = open('foo')
for block in f.iterchars(n=1024): ...

iterates through 1024-character blocks from the file. The default iterator
which loops through lines is not always a good choice since each line can
use an unbounded amount of memory. Default n in the above should be 1 char.

2. wrapped file openers:
There should be functions (either in itertools, builtins, the sys
module, or whereever) that open a file, expose one of the above
iterators, then close the file, i.e.
def file_lines(filename):
with f as open(filename):
for line in f:
yield line
so you can say

for line in file_lines(filename):
crunch(line)

The current bogus idiom is to say "for line in open(filename)" but
that does not promise to close the file once the file is exhausted
(part of the motivation of the new "with" statement). There should
similarly be "file_chars" which uses the n-chars iterator instead of
the line iterator.

3. itertools.ichain:
yields the contents of each of a sequence of iterators, i.e.:
def ichain(seq):
for s in seq:
for t in s:
yield t
this is different from itertools.chain because it lazy-evaluates its
input sequence. Example application:

all_filenames = ['file1', 'file2', 'file3']
# loop through all the files crunching all lines in each one
for line in (ichain(file_lines(x) for x in all_filenames)):
crunch(x)

4. functools enhancements (Haskell-inspired):
Let f be a function with 2 inputs. Then:
a) def flip(f): return lambda x,y: f(y,x)
b) def lsect(x,f): return partial(f,x)
c) def rsect(f,x): return partial(flip(f), x)

lsect and rsect allow making what Haskell calls "sections". Example:
# sequence of all squares less than 100
from operator import lt
s100 = takewhile(rsect(lt, 100), (x*x for x in count()))
 
P

Paul Rubin

Paul Rubin said:
# loop through all the files crunching all lines in each one
for line in (ichain(file_lines(x) for x in all_filenames)):
crunch(x)

supposed to say crunch(line) of course.
 
A

Alex Martelli

Paul Rubin said:
I just had to write some programs that crunched a lot of large files,
both text and binary. As I use iterators more I find myself wishing
for some maybe-obvious enhancements:

1. File iterator for blocks of chars:

f = open('foo')
for block in f.iterchars(n=1024): ...

iterates through 1024-character blocks from the file. The default iterator
which loops through lines is not always a good choice since each line can
use an unbounded amount of memory. Default n in the above should be 1 char.

the simple way (letting the file object deal w/buffering issues):

def iterchars(f, n=1):
while True:
x = f.read(n)
if not x: break
yield x

the fancy way (doing your own buffering) is left as an exercise for the
reader. I do agree it would be nice to have in some module.

2. wrapped file openers:
There should be functions (either in itertools, builtins, the sys
module, or whereever) that open a file, expose one of the above
iterators, then close the file, i.e.
def file_lines(filename):
with f as open(filename):
for line in f:
yield line
so you can say

for line in file_lines(filename):
crunch(line)

The current bogus idiom is to say "for line in open(filename)" but
that does not promise to close the file once the file is exhausted
(part of the motivation of the new "with" statement). There should
similarly be "file_chars" which uses the n-chars iterator instead of
the line iterator.

I'm +/-0 on this one vs the idioms:

with open(filename) as f:
for line in f: crunch(line)

with open(filename, 'rb') as f:
for block in iterchars(f): crunch(block)

Making two lines into one is a weak use case for a stdlib function.

3. itertools.ichain:
yields the contents of each of a sequence of iterators, i.e.:
def ichain(seq):
for s in seq:
for t in s:
yield t
this is different from itertools.chain because it lazy-evaluates its
input sequence. Example application:

all_filenames = ['file1', 'file2', 'file3']
# loop through all the files crunching all lines in each one
for line in (ichain(file_lines(x) for x in all_filenames)):
crunch(x)

Yes, subtle but important distinction.

4. functools enhancements (Haskell-inspired):
Let f be a function with 2 inputs. Then:
a) def flip(f): return lambda x,y: f(y,x)
b) def lsect(x,f): return partial(f,x)
c) def rsect(f,x): return partial(flip(f), x)

lsect and rsect allow making what Haskell calls "sections". Example:
# sequence of all squares less than 100
from operator import lt
s100 = takewhile(rsect(lt, 100), (x*x for x in count()))

Looks like they'd be useful, but I'm not sure about limiting them to
working with 2-argument functions only.


Alex
 
P

Paul Rubin

I'm +/-0 on this one vs the idioms:
with open(filename) as f:
for line in f: crunch(line)
Making two lines into one is a weak use case for a stdlib function.

Well, the inspiration is being able to use the iterator in another
genexp:

for line in (ichain(file_lines(x) for x in all_filenames)):
crunch(line)

so it's making more than two lines into one, and just flows more
naturally said:
Looks like they'd be useful, but I'm not sure about limiting them to
working with 2-argument functions only.

I'm not sure how to generalize them but if there's an obvious correct
way to do it, that sounds great ;).

Also forgot to include the obvious:

def compose(f,g):
return lambda(*args,**kw): f(g(*args,**kw))
 
A

Alexander Schmolck

Looks like they'd be useful, but I'm not sure about limiting them to
working with 2-argument functions only.

How's

from mysterymodule import resect
from operator import lt
takewhile(rsect(lt, 100), (x*x for x in count()))

better than

takewhile(lambda x:x<100, (x*x for x in count()))

Apart from boiler-plate creation and code-obfuscation purposes?

'as
 
R

rdhettinger

[Paul Rubin]
1. File iterator for blocks of chars:

f = open('foo')
for block in f.iterchars(n=1024): ...

for block in iter(partial(f.read, 1024), ''): ...


iterates through 1024-character blocks from the file. The default iterator
a) def flip(f): return lambda x,y: f(y,x)

Curious resemblance to:

itemgetter(1,0)


Raymond
 
P

Paul Rubin

for block in iter(partial(f.read, 1024), ''): ...

Hmm, nice. I keep forgetting about that feature of iter. It also came
up in a response to my queue example from another post.
Curious resemblance to:
itemgetter(1,0)

Not sure I understand that.
 
K

Klaas

(e-mail address removed) writes:

Not sure I understand that.

I think he read it as lambda (x, y): (y, x)

More interesting would be functools.rshift/lshift, that would rotate
the positional arguments (with wrapping)

def f(a, b, c, d, e):
....
rshift(f, 3) --> g, where g(c, d, e, a, b) == f(a, b, c, d, e)

Still don't see much advantage over writing a lambda (except perhaps
speed).

-Mike
 
P

Paul Rubin

Klaas said:
Still don't see much advantage over writing a lambda (except perhaps
speed).

Well, it's partly a matter of avoiding boilerplate, especially with
the lambdaphobia that many Python users seem to have.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,149
Members
46,695
Latest member
StanleyDri

Latest Threads

Top