itertools.groupby

J

Jason Friedman

I have a file such as:

$ cat my_data
Starting a new group
a
b
c
Starting a new group
1
2
3
4
Starting a new group
X
Y
Z
Starting a new group

I am wanting a list of lists:
['a', 'b', 'c']
['1', '2', '3', '4']
['X', 'Y', 'Z']
[]

I wrote this:
------------------------------------
#!/usr/bin/python3
from itertools import groupby

def get_lines_from_file(file_name):
with open(file_name) as reader:
for line in reader.readlines():
yield(line.strip())

counter = 0
def key_func(x):
if x.startswith("Starting a new group"):
global counter
counter += 1
return counter

for key, group in groupby(get_lines_from_file("my_data"), key_func):
print(list(group)[1:])
 
S

Steven D'Aprano

I have a file such as:

$ cat my_data
Starting a new group
a
b
c
Starting a new group
1
2
3
4
Starting a new group
X
Y
Z
Starting a new group

I am wanting a list of lists:
['a', 'b', 'c']
['1', '2', '3', '4']
['X', 'Y', 'Z']
[]

I wrote this: [...]
I get the output I desire, but I'm wondering if there is a solution
without the global counter.


I wouldn't use groupby. It's a hammer, not every grouping job is a nail.

Instead, use a simple accumulator:


def group(lines):
accum = []
for line in lines:
line = line.strip()
if line == 'Starting a new group':
if accum: # Don't bother if there are no accumulated lines.
yield accum
accum = []
else:
accum.append(line)
# Don't forget the last group of lines.
if accum: yield accum
 
J

Joshua Landau

On 21 April 2013 01:13, Steven D'Aprano <
I wouldn't use groupby. It's a hammer, not every grouping job is a nail.

Instead, use a simple accumulator:


def group(lines):
accum = []
for line in lines:
line = line.strip()
if line == 'Starting a new group':
if accum: # Don't bother if there are no accumulated lines.
yield accum
accum = []
else:
accum.append(line)
# Don't forget the last group of lines.
if accum: yield accum

Whilst yours is the simplest bar Dennis Lee Bieber's and nicer in that it
yields, neither of yours work for empty groups properly.

I recommend the simple change:

def group(lines):
accum = None
for line in lines:
line = line.strip()
if line == 'Starting a new group':
if accum is not None: # Don't bother if there are no
accumulated lines.
yield accum
accum = []
else:
accum.append(line)
# Don't forget the last group of lines.
yield accum

But will recommend my own small twist (because I think it is clever):

def group(lines):
lines = (line.strip() for line in lines)

if next(lines) != "Starting a new group":
raise ValueError("First line must be 'Starting a new group'")

while True:
acum = []

for line in lines:
if line == "Starting a new group":
break

acum.append(line)

else:
yield acum
break

yield acum
 
N

Neil Cerutti

I have a file such as:

$ cat my_data
Starting a new group
a
b
c
Starting a new group
1
2
3
4
Starting a new group
X
Y
Z
Starting a new group

I am wanting a list of lists:
['a', 'b', 'c']
['1', '2', '3', '4']
['X', 'Y', 'Z']
[]

Hrmmm, hoomm. Nobody cares for slicing any more.

def headered_groups(lst, header):
b = lst.index(header) + 1
while True:
try:
e = lst.index(header, b)
except ValueError:
yield lst[b:]
break
yield lst[b:e]
b = e+1

for group in headered_groups([line.strip() for line in open('data.txt')],
"Starting a new group"):
print(group)
 
O

Oscar Benjamin

Hrmmm, hoomm. Nobody cares for slicing any more.

def headered_groups(lst, header):
b = lst.index(header) + 1
while True:
try:
e = lst.index(header, b)
except ValueError:
yield lst[b:]
break
yield lst[b:e]
b = e+1

This requires the whole file to be read into memory. Iterators are
typically preferred over list slicing for sequential text file access
since you can avoid loading the whole file at once. This means that
you can process a large file while only using a constant amount of
memory.
for group in headered_groups([line.strip() for line in open('data.txt')],
"Starting a new group"):
print(group)

The list comprehension above loads the entire file into memory.
Assuming that .strip() is just being used to remove the newline at the
end it would be better to just use the readlines() method since that
loads everything into memory and removes the newlines. To remove them
without reading everything you can use map (or imap in Python 2):

with open('data.txt') as inputfile:
for group in headered_groups(map(str.strip, inputfile)):
print(group)


Oscar
 
N

Neil Cerutti

Hrmmm, hoomm. Nobody cares for slicing any more.

def headered_groups(lst, header):
b = lst.index(header) + 1
while True:
try:
e = lst.index(header, b)
except ValueError:
yield lst[b:]
break
yield lst[b:e]
b = e+1

This requires the whole file to be read into memory. Iterators
are typically preferred over list slicing for sequential text
file access since you can avoid loading the whole file at once.
This means that you can process a large file while only using a
constant amount of memory.

I agree, but this application processes unknowns-sized slices,
you have to build lists anyhow. I find slicing much more
convenient than accumulating in this case, but it's possibly a
tradeoff.
with open('data.txt') as inputfile:
for group in headered_groups(map(str.strip, inputfile)):
print(group)

Thanks, that's a nice improvement.
 
C

Chris Angelico

Iterators are
typically preferred over list slicing for sequential text file access
since you can avoid loading the whole file at once. This means that
you can process a large file while only using a constant amount of
memory.

And, perhaps even more importantly, allows you to pipe text in and
out. Obviously some operations (eg grep) lend themselves better to
this than do others (eg sort), but with this it ought at least to
output each group as it comes.

ChrisA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,990
Messages
2,570,211
Members
46,796
Latest member
SteveBreed

Latest Threads

Top