Good use for itertools.dropwhile and itertools.takewhile

C

Chris Angelico

takewhile mines for gold at the start of a sequence, dropwhile drops the dross at the start of a sequence.

When you're using both over the same sequence and with the same
condition, it seems odd that you need to iterate over it twice.
Perhaps a partitioning iterator would be cleaner - something like
this:

def partitionwhile(predicate, iterable):
iterable = iter(iterable)
while True:
val = next(iterable)
if not predicate(val): break
yield val
raise StopIteration # Signal the end of Phase 1
for val in iterable: yield val # or just "yield from iterable", I think

Only the cold hard boot of reality just stomped out the spark of an
idea. Once StopIteration has been raised, that's it, there's no
"resuming" the iterator. Is there a way around that? Is there a clean
way to say "Done for now, but next time you ask, there'll be more"?

I tested it on Python 3.2 (yeah, time I upgraded, I know).

ChrisA
 
N

Neil Cerutti

When you're using both over the same sequence and with the same
condition, it seems odd that you need to iterate over it twice.
Perhaps a partitioning iterator would be cleaner - something
like this:

def partitionwhile(predicate, iterable):
iterable = iter(iterable)
while True:
val = next(iterable)
if not predicate(val): break
yield val
raise StopIteration # Signal the end of Phase 1
for val in iterable: yield val # or just "yield from iterable", I think

Only the cold hard boot of reality just stomped out the spark
of an idea. Once StopIteration has been raised, that's it,
there's no "resuming" the iterator. Is there a way around that?
Is there a clean way to say "Done for now, but next time you
ask, there'll be more"?

I tested it on Python 3.2 (yeah, time I upgraded, I know).

Well, shoot! Then this is a job for groupby, not takewhile.

def prod_desc(s):
"""split s into product name and product description.
['CAR FIFTY TWO', 'Chrysler LeBaron.']
['MR. JONESEY', "Saskatchewan's finest"]
['', 'no product name?']
['NO DESCRIPTION', '']
"""
prod = ''
desc = ''
for k, g in itertools.groupby(s.split(),
key=lambda w: any(c.islower() for c in w)):
a = ' '.join(g)
if k:
desc = a
else:
prod = a
return [prod, desc]

This has no way to preserve odd white space which could break
evil product name differences.
 
M

Mark Lawrence

I tested it on Python 3.2 (yeah, time I upgraded, I know).

Bad move, fancy wanting to go to the completely useless version of
Python that simply can't handle unicode properly :)
 
I

Ian Kelly

Well, shoot! Then this is a job for groupby, not takewhile.

The problem with groupby is that you can't just limit it to two groups.
['QLD', 'fresh from']

Once you've got a false key from the groupby, you would need to
pretend that any subsequent groups are part of the false group and
tack them on.
 
I

Ian Kelly

When you're using both over the same sequence and with the same
condition, it seems odd that you need to iterate over it twice.
Perhaps a partitioning iterator would be cleaner - something like
this:

def partitionwhile(predicate, iterable):
iterable = iter(iterable)
while True:
val = next(iterable)
if not predicate(val): break
yield val
raise StopIteration # Signal the end of Phase 1
for val in iterable: yield val # or just "yield from iterable", I think

Only the cold hard boot of reality just stomped out the spark of an
idea. Once StopIteration has been raised, that's it, there's no
"resuming" the iterator. Is there a way around that? Is there a clean
way to say "Done for now, but next time you ask, there'll be more"?

Return two separate iterators, with the contract that the second
iterator can't be used until the first has completed. Combined with
Neil's groupby suggestion, we end up with something like this:

def partitionwhile(predicate, iterable):
it = itertools.groupby(iterable, lambda x: bool(predicate(x)))
pushback = missing = object()
def first():
nonlocal pushback
pred, subit = next(it)
if pred:
yield from subit
pushback = None
else:
pushback = subit
def second():
if pushback is missing:
raise TypeError("can't yield from second iterator before
first iterator completes")
elif pushback is not None:
yield from pushback
yield from itertools.chain.from_iterable(subit for key, subit in it)
return first(), second()
['CAPSICUM RED', 'fresh from QLD']
 
N

Nick Mellor

Hi Neil,

Here's some sample data. The live data is about 300 minor variations on the sample data, about 20,000 lines.

Nick

Notes:

1. Whitespace is only used for word boundaries. Surplus whitespace is not significant and can be stripped

2. Retain punctuation and parentheses

3. Product is zero or more words in all caps at start of line

4. Description is zero or more words beginning with first word that is not all caps. Description continues to the end of the line

5. Return tuple of strings (product, description)


Sample data
---

BEANS hand picked
BEETROOT certified organic
BOK CHOY (bunch)
BROCCOLI Mornington Peninsula
BRUSSEL SPROUTS
CABBAGE green
CABBAGE Red
CAPSICUM RED
CARROTS
CARROTS loose
CARROTS juicing, certified organic
CARROTS Trentham, large seconds, certified organic
CARROTS Trentham, firsts, certified organic
CAULIFLOWER
CELERY Mornington Peninsula IPM grower
CELERY Mornington Peninsula IPM grower
CUCUMBER
EGGPLANT
FENNEL
GARLIC (from Argentina)
GINGER fresh uncured
KALE (bunch)
KOHL RABI certified organic
LEEKS
LETTUCE iceberg
MUSHROOM cup or flat
MUSHROOM Swiss brown
ONION brown
ONION red
ONION spring (bunch)
PARSNIP, certified organic
POTATOES certified organic
POTATOES Sebago
POTATOES Desiree
POTATOES Bullarto chemical free
POTATOES Dutch Cream
POTATOES Nicola
POTATOES Pontiac
POTATOES Otway Red
POTATOES teardrop
PUMPKIN certified organic
SCHALLOTS brown
SNOW PEAS
SPINACH I'll try to get certified organic (bunch)
SWEET POTATO gold certified organic
SWEET POTATO red small
SWEDE certified organic
TOMATOES Qld
TURMERIC fresh certified organic
ZUCCHINI
APPLES Harcourt Pink Lady, Fuji, Granny Smith
APPLES Harcourt 2 kg bags, Pink Lady or Fuji (bag)
AVOCADOS
AVOCADOS certified organic, seconds
BANANAS Qld, organic
GRAPEFRUIT
GRAPES crimson seedless
KIWI FRUIT Qld certified organic
LEMONS
LIMES
MANDARINS
ORANGES Navel
PEARS Beurre Bosc Harcourt new season
PEARS Packham, Harcourt new season
SULTANAS 350g pre-packed bags
EGGS Melita free range, Barker's Creek
BASIL (bunch)
CORIANDER (bunch)
DILL (bunch)
MINT (bunch)
PARSLEY (bunch)
 
M

MRAB

When you're using both over the same sequence and with the same
condition, it seems odd that you need to iterate over it twice.
Perhaps a partitioning iterator would be cleaner - something like
this:

def partitionwhile(predicate, iterable):
iterable = iter(iterable)
while True:
val = next(iterable)
if not predicate(val): break
yield val
raise StopIteration # Signal the end of Phase 1
for val in iterable: yield val # or just "yield from iterable", I think

Only the cold hard boot of reality just stomped out the spark of an
idea. Once StopIteration has been raised, that's it, there's no
"resuming" the iterator. Is there a way around that? Is there a clean
way to say "Done for now, but next time you ask, there'll be more"?
Perhaps you could have some kind of partitioner object:

class Partitioner:
_SENTINEL = object()

def __init__(self, iterable):
self._iterable = iter(iterable)
self._unused_item = self._SENTINEL

def takewhile(self, condition):
if self._unused_item is not self._SENTINEL:
if not condition(self._unused_item):
raise StopIteration

yield self._unused_item
self._unused_item = self._SENTINEL

for item in self._iterable:
if not condition(item):
self._unused_item = item
break

yield item

raise StopIteration

def remainder(self):
if self._unused_item is not self._SENTINEL:
yield self._unused_item
self._unused_item = self._SENTINEL

for item in self._iterable:
yield item

raise StopIteration

def is_all_caps(word):
return word == word.upper()

part = Partitioner("CAPSICUM RED fresh from QLD".split())
product = " ".join(part.takewhile(is_all_caps))
description = " ".join(part.remainder())
print([product, description])
 
N

Neil Cerutti

Hi Neil,

Here's some sample data. The live data is about 300 minor
variations on the sample data, about 20,000 lines.

Thanks, Nick.

This slight variation on my first groupby try seems to work for
the test data.

def prod_desc(s):
prod = []
desc = []
for k, g in itertools.groupby(s.split(),
key=lambda w: any(c.islower() for c in w)):
if prod or k:
desc.extend(g)
else:
prod.extend(g)
return [' '.join(prod), ' '.join(desc)]
 
N

Nick Mellor

Neil,

Further down the data, found another edge case:

"Spring ONION from QLD"

Following the spec, the whole line should be description (description starts at first word that is not all caps.) This case breaks the latest groupby.

N
 
N

Neil Cerutti

Neil,

Further down the data, found another edge case:

"Spring ONION from QLD"

Following the spec, the whole line should be description
(description starts at first word that is not all caps.) This
case breaks the latest groupby.

A-ha! I did check your samples for the case of an empty product
name and not find any started to think it couldn't happen.

Change

if prod or k:

to

if desc or prod or k:

If this data file gets any weirder, let me know. ;)
 
V

Vlastimil Brom

2012/12/5 Nick Mellor said:
Neil,

Further down the data, found another edge case:

"Spring ONION from QLD"

Following the spec, the whole line should be description (description starts at first word that is not all caps.) This case breaks the latest groupby.

N

Hi,
Just for completeness..., it (likely) can be done using regex (given
the current specificatioin), but if the data are even more complex and
varying, the tools like pyparsing or dedicated parsing functions might
be more appropriate;

hth,
vbr:

.... BEETROOT certified organic
.... BOK CHOY (bunch)
.... BROCCOLI Mornington Peninsula
.... BRUSSEL SPROUTS
.... CABBAGE green
.... CABBAGE Red
.... CAPSICUM RED
.... CARROTS
.... CARROTS loose
.... CARROTS juicing, certified organic
.... CARROTS Trentham, large seconds, certified organic
.... CARROTS Trentham, firsts, certified organic
.... CAULIFLOWER
.... CELERY Mornington Peninsula IPM grower
.... CELERY Mornington Peninsula IPM grower
.... CUCUMBER
.... EGGPLANT
.... FENNEL
.... GARLIC (from Argentina)
.... GINGER fresh uncured
.... KALE (bunch)
.... KOHL RABI certified organic
.... LEEKS
.... LETTUCE iceberg
.... MUSHROOM cup or flat
.... MUSHROOM Swiss brown
.... ONION brown
.... ONION red
.... ONION spring (bunch)
.... PARSNIP, certified organic
.... POTATOES certified organic
.... POTATOES Sebago
.... POTATOES Desiree
.... POTATOES Bullarto chemical free
.... POTATOES Dutch Cream
.... POTATOES Nicola
.... POTATOES Pontiac
.... POTATOES Otway Red
.... POTATOES teardrop
.... PUMPKIN certified organic
.... SCHALLOTS brown
.... SNOW PEAS
.... SPINACH I'll try to get certified organic (bunch)
.... SWEET POTATO gold certified organic
.... SWEET POTATO red small
.... SWEDE certified organic
.... TOMATOES Qld
.... TURMERIC fresh certified organic
.... ZUCCHINI
.... APPLES Harcourt Pink Lady, Fuji, Granny Smith
.... APPLES Harcourt 2 kg bags, Pink Lady or Fuji (bag)
.... AVOCADOS
.... AVOCADOS certified organic, seconds
.... BANANAS Qld, organic
.... GRAPEFRUIT
.... GRAPES crimson seedless
.... KIWI FRUIT Qld certified organic
.... LEMONS
.... LIMES
.... MANDARINS
.... ORANGES Navel
.... PEARS Beurre Bosc Harcourt new season
.... PEARS Packham, Harcourt new season
.... SULTANAS 350g pre-packed bags
.... EGGS Melita free range, Barker's Creek
.... BASIL (bunch)
.... CORIANDER (bunch)
.... DILL (bunch)
.... MINT (bunch)
.... PARSLEY (bunch)
.... Spring ONION from QLD"""
len(test_product_data.splitlines()) 72

for prod_item in re.findall(r"(?m)(?=^.+$)^ *(?:([A-Z ]+\b(?<! )(?=[\s,]|$)))?(?: *(.*))?$", test_product_data): print prod_item
....
('BEANS', 'hand picked')
('BEETROOT', 'certified organic')
('BOK CHOY', '(bunch)')
('BROCCOLI', 'Mornington Peninsula')
('BRUSSEL SPROUTS', '')
('CABBAGE', 'green')
('CABBAGE', 'Red')
('CAPSICUM RED', '')
('CARROTS', '')
('CARROTS', 'loose')
('CARROTS', 'juicing, certified organic')
('CARROTS', 'Trentham, large seconds, certified organic')
('CARROTS', 'Trentham, firsts, certified organic')
('CAULIFLOWER', '')
('CELERY', 'Mornington Peninsula IPM grower')
('CELERY', 'Mornington Peninsula IPM grower')
('CUCUMBER', '')
('EGGPLANT', '')
('FENNEL', '')
('GARLIC', '(from Argentina)')
('GINGER', 'fresh uncured')
('KALE', '(bunch)')
('KOHL RABI', 'certified organic')
('LEEKS', '')
('LETTUCE', 'iceberg')
('MUSHROOM', 'cup or flat')
('MUSHROOM', 'Swiss brown')
('ONION', 'brown')
('ONION', 'red')
('ONION', 'spring (bunch)')
('PARSNIP', ', certified organic')
('POTATOES', 'certified organic')
('POTATOES', 'Sebago')
('POTATOES', 'Desiree')
('POTATOES', 'Bullarto chemical free')
('POTATOES', 'Dutch Cream')
('POTATOES', 'Nicola')
('POTATOES', 'Pontiac')
('POTATOES', 'Otway Red')
('POTATOES', 'teardrop')
('PUMPKIN', 'certified organic')
('SCHALLOTS', 'brown')
('SNOW PEAS', '')
('SPINACH', "I'll try to get certified organic (bunch)")
('SWEET POTATO', 'gold certified organic')
('SWEET POTATO', 'red small')
('SWEDE', 'certified organic')
('TOMATOES', 'Qld')
('TURMERIC', 'fresh certified organic')
('ZUCCHINI', '')
('APPLES', 'Harcourt Pink Lady, Fuji, Granny Smith')
('APPLES', 'Harcourt 2 kg bags, Pink Lady or Fuji (bag)')
('AVOCADOS', '')
('AVOCADOS', 'certified organic, seconds')
('BANANAS', 'Qld, organic')
('GRAPEFRUIT', '')
('GRAPES', 'crimson seedless')
('KIWI FRUIT', 'Qld certified organic')
('LEMONS', '')
('LIMES', '')
('MANDARINS', '')
('ORANGES', 'Navel')
('PEARS', 'Beurre Bosc Harcourt new season')
('PEARS', 'Packham, Harcourt new season')
('SULTANAS', '350g pre-packed bags')
('EGGS', "Melita free range, Barker's Creek")
('BASIL', '(bunch)')
('CORIANDER', '(bunch)')
('DILL', '(bunch)')
('MINT', '(bunch)')
('PARSLEY', '(bunch)')
('', 'Spring ONION from QLD')
len(re.findall(r"(?m)(?=^.+$)^ *(?:([A-Z ]+\b(?<! )(?=[\s,]|$)))?(?: *(.*))?$", test_product_data)) 72
 
A

Alexander Blinne

Am 05.12.2012 18:04, schrieb Nick Mellor:
Sample data

Well let's see what

def split_product(p):
p = p.strip()
w = p.split(" ")
try:
j = next(i for i,v in enumerate(w) if v.upper() != v)
except StopIteration:
return p, ''
return " ".join(w[:j]), " ".join(w[j:])

(which i still find a very elegant solution) has to say about those
sample data:
.... print(split_product(line))
('BEANS', 'hand picked')
('BEETROOT', 'certified organic')
('BOK CHOY', '(bunch)')
('BROCCOLI', 'Mornington Peninsula')
('BRUSSEL SPROUTS', '')
('CABBAGE', 'green')
('CABBAGE', 'Red')
('CAPSICUM RED', '')
('CARROTS', '')
('CARROTS', 'loose')
('CARROTS', 'juicing, certified organic')
('CARROTS', 'Trentham, large seconds, certified organic')
('CARROTS', 'Trentham, firsts, certified organic')
('CAULIFLOWER', '')
('CELERY', 'Mornington Peninsula IPM grower')
('CELERY', 'Mornington Peninsula IPM grower')
('CUCUMBER', '')
('EGGPLANT', '')
('FENNEL', '')
('GARLIC', '(from Argentina)')
('GINGER', 'fresh uncured')
('KALE', '(bunch)')
('KOHL RABI', 'certified organic')
('LEEKS', '')
('LETTUCE', 'iceberg')
('MUSHROOM', 'cup or flat')
('MUSHROOM', 'Swiss brown')
('ONION', 'brown')
('ONION', 'red')
('ONION', 'spring (bunch)')
('PARSNIP,', 'certified organic')
('POTATOES', 'certified organic')
('POTATOES', 'Sebago')
('POTATOES', 'Desiree')
('POTATOES', 'Bullarto chemical free')
('POTATOES', 'Dutch Cream')
('POTATOES', 'Nicola')
('POTATOES', 'Pontiac')
('POTATOES', 'Otway Red')
('POTATOES', 'teardrop')
('PUMPKIN', 'certified organic')
('SCHALLOTS', 'brown')
('SNOW PEAS', '')
('SPINACH', "I'll try to get certified organic (bunch)")
('SWEET POTATO', 'gold certified organic')
('SWEET POTATO', 'red small')
('SWEDE', 'certified organic')
('TOMATOES ', 'Qld')
('TURMERIC', 'fresh certified organic')
('ZUCCHINI', '')
('APPLES', 'Harcourt Pink Lady, Fuji, Granny Smith')
('APPLES', 'Harcourt 2 kg bags, Pink Lady or Fuji (bag)')
('AVOCADOS', '')
('AVOCADOS', 'certified organic, seconds')
('BANANAS', 'Qld, organic')
('GRAPEFRUIT', '')
('GRAPES', 'crimson seedless')
('KIWI FRUIT', 'Qld certified organic')
('LEMONS', '')
('LIMES', '')
('MANDARINS', '')
('ORANGES', 'Navel')
('PEARS', 'Beurre Bosc Harcourt new season')
('PEARS', 'Packham, Harcourt new season')
('SULTANAS', '350g pre-packed bags')
('EGGS', "Melita free range, Barker's Creek")
('BASIL', '(bunch)')
('CORIANDER', '(bunch)')
('DILL', '(bunch)')
('MINT', '(bunch)')
('PARSLEY', '(bunch)')
('', 'Spring ONION from QLD')

I think the only thing one is left to think about is the
('PARSNIP,', 'certified organic')
case. What about that extra comma? Perhaps it could even be considered
an "error" in the original data? I don't see a good general way to deal
with those which does not have to handle trailing punctuation on the
product name explicitly as a special case.

Greetings
 
V

Vlastimil Brom

2012/12/6 Neil Cerutti said:
I'm not sure on this one.

Well, I wasn't either, when I noticed this item, but given the specification:
"2. Retain punctuation and parentheses"
in one of the previous OP's messages, I figured, the punctuation would
better be a part of the description rather than the name in this case.

regards,
vbr
 
P

Paul Rubin

Nick Mellor said:
I came across itertools.dropwhile only today, then shortly afterwards
found Raymond Hettinger wondering, in 2007, whether to drop [sic]
dropwhile and takewhile from the itertools module....
Almost nobody else of the 18 respondents seemed to be using them.

What? I'm amazed by that. I didn't bother reading the old thread, but
I use those functions fairly frequently. I just used takewhile the
other day, processing a timestamped log file where I wanted to look at
certain clusters of events. I won't post the actual code here, but
takewhile was a handy way to pull out intervals of interest after an
event was seen.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,143
Messages
2,570,822
Members
47,368
Latest member
michaelsmithh

Latest Threads

Top