negative "counts" in collections.Counter?

V

Vlastimil Brom

Hi all,
I'd like to ask about the possibility of negative "counts" in
collections.Counter (using Python 3.1);
I believe, my usecase is rather trivial, basically I have the word
frequencies of two texts and I want do compare them (e.g. to see what
was added and removed between different versions of a text).

This is simple enough to do with own code, but I thought, this would
be exactly the case for Counter...
However, as the Counter only returns positive counts, one has to get
the difference in both directions and combine them afterwards, maybe
something like:

It seems to me, that with negative counts allowed in Counter, this
would simply be the matter of a single difference:
changed_c2 = c2 - c1

Is there a possibility to make the Counter work this way (other than
replacing its methods in a subclass, which might be comparable to
writing the naive counting class itself)?
Are there maybe some reasons I missed to disable negative counts here?
(As I could roughly understand, the Counter isn't quite limited to the
mathematical notion of multiset; it seems to accept negative counts,
but its methods only output the positive part).
Is this kind of task - a comparison in both directions - an unusual
one, or is it simply not the case for Counter?

Thanks in advance,
vbr
 
A

Arnaud Delobelle

Vlastimil Brom said:
Hi all,
I'd like to ask about the possibility of negative "counts" in
collections.Counter (using Python 3.1);
I believe, my usecase is rather trivial, basically I have the word
frequencies of two texts and I want do compare them (e.g. to see what
was added and removed between different versions of a text).

This is simple enough to do with own code, but I thought, this would
be exactly the case for Counter...
However, as the Counter only returns positive counts, one has to get
the difference in both directions and combine them afterwards, maybe
something like:


It seems to me, that with negative counts allowed in Counter, this
would simply be the matter of a single difference:
changed_c2 = c2 - c1

Is there a possibility to make the Counter work this way (other than
replacing its methods in a subclass, which might be comparable to
writing the naive counting class itself)?
Are there maybe some reasons I missed to disable negative counts here?
(As I could roughly understand, the Counter isn't quite limited to the
mathematical notion of multiset; it seems to accept negative counts,
but its methods only output the positive part).
Is this kind of task - a comparison in both directions - an unusual
one, or is it simply not the case for Counter?

Every time I have needed something like collections.Counter, I have
wanted the behaviour you require too. As a result, I have never used
collections.Counter. Instead I have used plain dictionaries or my own
class.

I don't understand why the Counter's + and - operators behave as they
do. Here is an example from the docs:
c = Counter(a=3, b=1)
d = Counter(a=1, b=2)
c + d # add two counters together: c[x] + d[x] Counter({'a': 4, 'b': 3})
c - d # subtract (keeping only positive counts) Counter({'a': 2})
c & d # intersection: min(c[x], d[x]) Counter({'a': 1, 'b': 1})
c | d # union: max(c[x], d[x])
Counter({'a': 3, 'b': 2})

If + and - just added or subtracted the multiplicities of elements,
keeping negative multiplicites, we would get:
Counter({'a':2, 'b':-1})

Which I think is useful in many cases. But we could still get the
result of current c - d very simply:
Counter({'a':2})

Altogether more versatile and coherent IMHO.
 
S

Steven D'Aprano

Vlastimil Brom said:
Hi all,
I'd like to ask about the possibility of negative "counts" in
collections.Counter (using Python 3.1); I believe, my usecase is rather
trivial, basically I have the word frequencies of two texts and I want
do compare them (e.g. to see what was added and removed between
different versions of a text).
[...]

Every time I have needed something like collections.Counter, I have
wanted the behaviour you require too. As a result, I have never used
collections.Counter. Instead I have used plain dictionaries or my own
class.

Vlastimil and Arnaud,

It looks like Counter already has negative counts in Python 3.1:
import collections
c = collections.Counter({'red': 2, 'green': 0, 'blue': -5})
c['blue'] -= 1
c Counter({'red': 2, 'green': 0, 'blue': -6})
c['blue'] += 1
c
Counter({'red': 2, 'green': 0, 'blue': -5})

But the + and - operators destroy negative and zero counts:
Counter({'red': 2})

I can't imagine what the use-case for that behaviour is.

Given that Counter supports negative counts, it looks to me that the
behaviour of __add__ and __sub__ is fundamentally flawed. You should
raise a bug report (feature enhancement) on the bug tracker.

http://bugs.python.org/

Given that this will change the documented behaviour, it will help if you
give a short, simple idiom for removing zero and negative elements,
Arnaud's trick with | Counter().

When you raise the report, please post an update here.
 
R

Raymond Hettinger

Given that Counter supports negative counts, it looks to me that the
behaviour of __add__ and __sub__ is fundamentally flawed. You should
raise a bug report (feature enhancement) on the bug tracker.

It isn't a bug. I designed it that way.
There were several possible design choices,
each benefitting different use cases.

FWIW, here is the reasoning behind the design.

The basic approach to Counter() is to be a dict subclass
that supplies zero for missing values. This approach
places almost no restrictions on what can be stored in it.
You can store floats, decimals, fractions, etc.
Numbers can be positive, negative, or zero.

This design leaves users with a good deal of flexibility,
but more importantly it keeps the API very close to
regular dictionaries (thus lowering the learning curve).
It also makes basic counter operations very fast.

The update() method differs from regular dictionaries
because dict.update() isn't very helpful in a counter context.
Instead, it allows one counter to update another. Like
the other basic counter operations, it places no restrictions
on type (i.e. it works on with ints, floats, decimals, fractions,
etc).
or on sign (positive, negative, or zero). The only requirement
is that the values support addition.

The basic API also adds some convenience methods.
The most_common() method tries to not be restrictive.
It only requires the count values be orderable.

For obvious reasons, the elements() method *does* have
restrictions and won't work with float values and won't
emit entries with non-positive counts.

Beyond the basic API listed above which is straight-forward,
the area where there were more interesting design choices
were the Counter-to-Counter operations (__add__, __sub__,
__or__, and __and__).

One possible choice (the one preferred by the OP) was to
has addition and subtraction be straight adds and subtracts
without respect to sign and to not support __and__ and __or__.
Straight addition was already supported via the update() method.
But no direct support was provided for straight subtractions
that leave negative values. Sorry about that.

Instead the choice was to implement the four methods as
multiset operations. As such, they need to correspond
to regular set operations. For example, with sets:

set('abc') - set('cde') # gives set('ab')

Notice how subtracting 'e' did not produce a negative result?
Now with multisets, we want the same result:

Counter({'a':1, 'b':1, 'c':1'}) - Counter({'c':1, 'd':1, 'e':1})

So the design decision was to support multiset operations
that produces only results with positive counts.

These multiset-style mathematical operations are discussed in:
Knuth's TAOCP Volume II section 4.6.3 exercise 19
and at http://en.wikipedia.org/wiki/Multiset .
Also C++ has multisets that support only positive counts.

So, there you have it. There design tries to be as
unrestrictive as possible. When a choice had to be
made, it favored multisets over behaviors supporting
negative values.

Fortunately, it is trivially easy to extend the class
to add any behavior you want.

class MyCounter(Counter):
def __sub__(self, other):
result = self.copy()
for elem, cnt in other.items():
result[elem] -= cnt

Hopes this gives you some insight into the design choices.


Raymond Hettinger
 
V

Vlastimil Brom

2010/3/8 Raymond Hettinger said:
It isn't a bug.  I designed it that way.
There were several possible design choices,
each benefitting different use cases.
...
One possible choice (the one preferred by the OP) was to
has addition and subtraction be straight adds and subtracts
without respect to sign and to not support __and__ and __or__.
Straight addition was already supported via the update() method.
But no direct support was provided for straight subtractions
that leave negative values.  Sorry about that.

Instead the choice was to implement the four methods as
multiset operations.  As such, they need to correspond
to regular set operations.  For example, with sets:

...
Hopes this gives you some insight into the design choices.


Raymond Hettinger
Thank you very much for the exhaustive explanation Raymond!
I very much suspected, there would be some exact reasoning behind it,
as these negative counts are explicitly handled in the code of these
methods.
I just happened to expect the straightforward addition and subtraction
and possibly the zero truncation or an exception in incompatible cases
(like elements() with floats, currently). This way also the negative
part would be available and the truncation possible.
I am by far not able to follow all of the mathematical background, but
even for zero-truncating multiset, I would expect the truncation on
input rather than on output of some operations. E.g. the following
seems surprising to me in context of the discarded negatives by
addition or subtraction:

Probably a kind of negative_update() or some better named method will
be handy, like the one you supplied or simply the current module code
without the newcount > 0: ... condition. Or would it be an option to
have a keyword argument like zero_truncate=False which would influence
this behaviour? Or is some other method than elements() incompatible
with negative counts?
(I actually can imagine comparing negative counts on both sides. e.g.
in a secondary comparison of two wordlists with specific removed items
- comparing to the "master" list.)
Additionally, were issubset and issuperset considered for this
interface (not sure whether symmetric_difference would be applicable)?

Anyway, I recognise, that I can easily use a custom class for these
tasks, if these usecases are rare or non-standard for this general
collection object.
Thanks again,
vbr
 
S

Steven D'Aprano

It isn't a bug. I designed it that way. There were several possible
design choices, each benefitting different use cases.

Thanks for the explanation Raymond. A few comments follow:

FWIW, here is the reasoning behind the design.

The basic approach to Counter() is to be a dict subclass that supplies
zero for missing values. This approach places almost no restrictions
on what can be stored in it. You can store floats, decimals, fractions,
etc. Numbers can be positive, negative, or zero.

Another way of using default values in a dict. That's five that I know
of: dict.get, dict.setdefault, dict.pop, collections.defaultdict, and
collections.Counter. And the Perl people criticise Python for having
"only one way to do it" *wink*

(That's not meant as a criticism, merely an observation.)


[...]
One possible choice (the one preferred by the OP) was to has addition
and subtraction be straight adds and subtracts without respect to sign
and to not support __and__ and __or__. Straight addition was already
supported via the update() method. But no direct support was provided
for straight subtractions that leave negative values. Sorry about that.

Would you consider a feature enhancement adding an additional method,
analogous to update(), to perform subtractions? I recognise that it's
easy to subclass and do it yourself, but there does seem to be some
demand for it, and it is an obvious feature given that Counter does
support negative counts.

Instead the choice was to implement the four methods as multiset
operations. As such, they need to correspond to regular set operations.

Personally, I think the behaviour of + and - would be far less surprising
if the class was called Multiset. Intuitively, one would expect counters
to be limited to ints, and to support negative counts when adding and
subtracting. In hindsight, do you think that Multiset would have been a
better name?
 
R

Raymond Hettinger

[Steven D'Aprano]
Thanks for the explanation Raymond. A few comments follow:

You're welcome :)

Would you consider a feature enhancement adding an additional method,
analogous to update(), to perform subtractions? I recognise that it's
easy to subclass and do it yourself, but there does seem to be some
demand for it, and it is an obvious feature given that Counter does
support negative counts.

Will continue to mull it over.

Instinct says that conflating two models can be worse for usability
than just picking one of the models and excluding the other.

If I had it to do over, there is a reasonable case that elementwise
vector methods (__add__, __sub__, and __mul__) may have been a more
useful choice than multiset methods (__add__, __sub__, __and__,
__or__).

That being said, the multiset approach was the one that was chosen.
It was indicated for people who have experience with bags or
multisets in other languages. It was also consistent with the naming
of the class as tool for counting things (i.e. it handles counting
numbers right out of the box). No explicit support is provided
for negative values, but it isn't actively hindered either.

For applications needing elementwise vector operations and signed
arithmetic, arguably they should be using a more powerful toolset,
perhaps supporting a full-range of elementwise binary and unary
operations and a dotproduct() method. Someone should write that
class and post it to the ASPN Cookbook to see if there is any uptake.

Personally, I think the behaviour of + and - would be far less surprising
if the class was called Multiset. Intuitively, one would expect counters
to be limited to ints, and to support negative counts when adding and
subtracting. In hindsight, do you think that Multiset would have been a
better name?

The primary use case for Counter() is to count things (using the
counting numbers).
The term Multiset is more obscure and only applies
to the four operations that eliminate non-positive results.
So, I'm somewhat happy with the current name.

FWIW, the notion of "what is surprising" often depends on the
observer's
background and on the problem they are currently trying to solve.
If you need negative counts, then Counter.__sub__() is surprising.
If your app has no notion of a negative count, then it isn't.
The docs, examples, and docstrings are very clear about the behavior,
so the "surprise" is really about wanting it to do something other
than what it currently does ;-)


Raymond
 
R

Raymond Hettinger

[Vlastimil Brom]
Thank you very much for the exhaustive explanation Raymond!

You're welcome.

I am by far not able to follow all of the mathematical background, but
even for zero-truncating multiset, I would expect the truncation on
input rather than on output of some operations.

I debated about this and opted for be-loose-in-receiving-and-strict-on-
output.
One thought is that use cases for multisets would have real multisets
as inputs (no negative counts) and as outputs. The user controls
the inputs, and the method only has a say in what its outputs are.

Also, truncating input would complicate the mathematical definition
of
what is happening. Compare:

r = a[x] - b[x]
if r > 0:
emit(r)

vs.

r = max(0, a[x]) - max(0, b[x])
if r > 0:
emit(r)

Also, the design parallels what is done in the decimal module
where rounding is applied only to the results of operations,
not to the inputs.

Probably a kind of negative_update()  or some better named method will
be handy, like the one you supplied or simply the current module code
without the newcount > 0: ... condition.

See my other post on this subject. There is no doubt that
such a method would be handy for signed arithmetic.
The question is whether conflating two different models hurts
the API more than it helps. Right now, the Counter() class
has no explicit support for negative values. It is
designed around natural numbers and counting numbers.

Or would it be an option to
have a keyword argument like zero_truncate=False which would influence
this behaviour?

Guido's thoughts on behavior flags is that they are usually a signal
that you need two different classes. That is why itertools has
ifilter() and ifilterfalse() or izip() and izip_longest() instead
of having behavior flags.

In this case, we have an indication that what you really want is
a separate class supporting elementwise binary and unary operations
on vectors (where the vector fields are accessed by a dictionary
key instead of a positional value).

Additionally, were issubset and issuperset considered for this
interface (not sure whether symmetric_difference would be applicable)?

If the need arises, these could be included. Right now, you
can get the same result with: "if a - b: ..."

FWIW, I never liked those two method names. Can't remember whether
a.issubset(b) means "a is a subset of b" or "b issubset of a'.


Raymond
 
G

Gregory Ewing

Raymond said:
Instead the choice was to implement the four methods as
multiset operations. As such, they need to correspond
to regular set operations.

Seems to me you're trying to make one data type do the
work of two, and ending up with something inconsistent.

I think you should be providing two types: one is a
multiset, which disallows negative counts altogether;
the other behaves like a sparse vector with appropriate
arithmetic operations.
 
V

Vlastimil Brom

2010/3/8 Raymond Hettinger <[email protected]>:
....
[snip detailed explanations]
...
In this case, we have an indication that what you really want is
a separate class supporting elementwise binary and unary operations
on vectors (where the vector fields are accessed by a dictionary
key instead of a positional value).



If the need arises, these could be included.  Right now, you
can get the same result with:  "if a - b: ..."

FWIW, I never liked those two method names.  Can't remember whether
a.issubset(b) means "a is a subset of b" or "b issubset of a'.


Raymond
Thanks for the further remarks Raymond,
initially I thought while investigating new features of python 3, this
would be a case for replacing the "home made" solutions with the
standard module functionality.
Now I can see, it probably wouldn't be an appropriate decision in this
case, as the expected usage of Counter with its native methods is
different.

As for the issubset, issuperset method names, I am glad, a far more
skilled person has the same problem like me :) In this case the
operators appear to be clearer than the method names...

regards,
vbr
 
R

Raymond Hettinger

[Gregory Ewing]
I think you should be providing two types: one is a
multiset, which disallows negative counts altogether;
the other behaves like a sparse vector with appropriate
arithmetic operations.


That is pretty close to what I said above:

"""
In this case, we have an indication that what you really want is a
separate class supporting elementwise binary and unary operations on
vectors (where the vector fields are accessed by a dictionary key
instead of a positional value).
"""

The first and foremost goal of the class was to provide a simple
counting capability along the lines of:

class Counter(dict):
def __init__(self, iterable=()):
for elem in iterable:
self[elem] += 1
def __missing__(self, key):
return 0

Note, there is no "disallowing negatives" or even any insistence
that the "counts" are actually numbers. In that respect, it was
a consenting adults class from the start. It is short, simple,
and fast.

The other methods were mostly just "extras" to make the tool
more useful. Each adds its own minimal restrictions:

elements() requires integer counts
most_common() requires that counts be orderable
multiset operations are defined for non-negative values

We could have kept this same design and wrapped every dictionary
method so that it would raise an exception for any count that was not
a positive integer. That would have slowed down the class, increased
the code volume, and precluded some uses that are currently possible.
IMO, that would not have been a win and it would not have helped the
OP who was seeking a more vector-like tool for signed-number
operations.

Anyway, it is what it is. The tool is released and most
users seem to be happy with what we have.


Raymond
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,701
Latest member
XavierQ83

Latest Threads

Top