Counting number of each item in a list.

sophie_newbie · Mar 19, 2006

I have a list a little something like this:

StringA
StringC
StringB
StringA
StringC
StringD
StringA
....
etc.

Basically I was wondering if there was an easy way to return how many
of each string are in the list, something like this:

StringA - 3
StringB - 1
StringC - 2
StringD - 1

I suppose that the easiest way to do that is to convert it to a 2
dimensional array? Is there any easy way?

Thanks.

Paul Rubin · Mar 19, 2006

sophie_newbie said:
I suppose that the easiest way to do that is to convert it to a 2
dimensional array? Is there any easy way?

Use a hash table. Untested:

h = 0
for x in your_list:
h[x] = h.get(x, 0) + 1

You can then use something like sorted(h.items()) to get a sorted list
of those counts.

sophie_newbie · Mar 19, 2006

Hey Bruno,

I got an invalid syntax error when i tried using your "str_counts =
dict((s, str_list.count(s) for s in set(str_list))" bit of code? Maybe
there is a missing bracket or comma? Or maybe I need to import
something.

Thanks so much for your help.

sophie_newbie · Mar 19, 2006

Hi Paul,

Ur bit of code works, thanks so much, just for anyone reading this use
'h = {}' instead of h = 0.

Thanks

Kent Johnson · Mar 19, 2006

sophie_newbie said:
Hey Bruno,

I got an invalid syntax error when i tried using your "str_counts =
dict((s, str_list.count(s) for s in set(str_list))" bit of code? Maybe
there is a missing bracket or comma? Or maybe I need to import
something.

It should be
str_counts = dict((s, str_list.count(s)) for s in set(str_list))

or for Python < 2.4
str_counts = dict([(s, str_list.count(s)) for s in set(str_list)])

Note that this solution iterates str_list once for each element of
str_list - the call to count traverses the entire list to create the
count. I expect Paul Rubin's solution will be dramatically faster for
large lists as it only iterates str_list once.

Kent

Bruno Desthuilliers · Mar 19, 2006

sophie_newbie a écrit :

I have a list a little something like this:

StringA
StringC
StringB
StringA
StringC
StringD
StringA
...
etc.

Basically I was wondering if there was an easy way to return how many
of each string are in the list, something like this:

StringA - 3
StringB - 1
StringC - 2
StringD - 1

There is.

str_list = ['StringA', 'StringC', 'StringB',
'StringA', 'StringC', 'StringD',
'StringA' ]

str_counts = dict((s, str_list.count(s) for s in set(str_list))

bearophileHUGS · Mar 19, 2006

With CPython this is probably the faster version, the version with
list.count() has to be used only if you are sure to always have a short
input list, because it's O(n^2):

str_list = ['StringA', 'StringC', 'StringB',
'StringA', 'StringC', 'StringD',
'StringA' ]

str_counts = {}
for s in str_list:
if s in str_counts:
str_counts += 1
else:
str_counts = 1

Bye,
bearophile

Paul Rubin · Mar 19, 2006

Bruno Desthuilliers said:
And of course, I was right. My solution seems to be faster than Paul's
one (but slower than bearophile's), be it on small, medium or large
lists.

nb: A is mine, B is Paul's and C is bearophile's, and the number after
is the size of the list...

Interesting. I wonder if you could try it with a much larger number
of distinct values in the list.

Justin Azoff · Mar 19, 2006

Bruno said:
And of course, I was right. My solution seems to be faster than Paul's
one (but slower than bearophile's), be it on small, medium or large lists.

Your version is only fast on lists with a very small number of unique
elements.

changing mklist to have
items = range(64) instead of the 9 item list and re-timing you will get
"better" results:

A100 (10000 times): 7.63829684258
B100 (10000 times): 1.34028482437
C100 (10000 times): 0.812223911285

A10000 (100 times): 9.78499102592
B10000 (100 times): 1.26520299911
C10000 (100 times): 0.857560873032

A1000000 (10 times): 87.6713900566
B1000000 (10 times): 12.7302949429
C1000000 (10 times): 8.35931396484

Bruno Desthuilliers · Mar 19, 2006

sophie_newbie a écrit :

Hey Bruno,

I got an invalid syntax error when i tried using your "str_counts =
dict((s, str_list.count(s) for s in set(str_list))" bit of code? Maybe
there is a missing bracket or comma?

Yes, sorry, see Kent's post for the correction.

Or maybe I need to import
something.

Nope, unless you're running python < 2.4

Bruno Desthuilliers · Mar 19, 2006

Kent Johnson a écrit :

It should be
str_counts = dict((s, str_list.count(s)) for s in set(str_list))

Of course, my bad

or for Python < 2.4

from sets import Set as set

str_counts = dict([(s, str_list.count(s)) for s in set(str_list)])

Note that this solution iterates str_list once for each element of
str_list

once for each *distinct* element or str_list (it iterates over the set
created from str_list).

- the call to count traverses the entire list to create the
count. I expect Paul Rubin's solution will be dramatically faster for
large lists as it only iterates str_list once.

Yeps. But I would benchmark it before choosing one or the other solution.

Bruno Desthuilliers · Mar 19, 2006

Bruno Desthuilliers a écrit :

Kent Johnson a écrit :

It should be
str_counts = dict((s, str_list.count(s)) for s in set(str_list))

Click to expand...

Of course, my bad

or for Python < 2.4

Click to expand...

from sets import Set as set

str_counts = dict([(s, str_list.count(s)) for s in set(str_list)])

Note that this solution iterates str_list once for each element of
str_list

Click to expand...

once for each *distinct* element or str_list (it iterates over the set
created from str_list).

- the call to count traverses the entire list to create the count. I
expect Paul Rubin's solution will be dramatically faster for large
lists as it only iterates str_list once.

Click to expand...

Yeps. But I would benchmark it before choosing one or the other solution.

And of course, I was right. My solution seems to be faster than Paul's
one (but slower than bearophile's), be it on small, medium or large lists.

nb: A is mine, B is Paul's and C is bearophile's, and the number after
is the size of the list...

A100 (10000 times): 1.5801050663
B100 (10000 times): 1.87287902832
C100 (10000 times): 0.991976976395
A10000 (100 times): 1.083589077
B10000 (100 times): 1.30713891983
C10000 (100 times): 0.988032817841
A1000000 (10 times): 10.5345788002
B1000000 (10 times): 13.094493866
C1000000 (10 times): 9.67438292503

source:

def countA(lst):
return dict((s, lst.count(s)) for s in set(lst))

def countB(lst):
counts = {}
for s in lst:
counts = counts.get(s, 0)+1
return counts

def countC(lst):
counts = {}
for s in lst:
if s in counts:
counts += 1
else:
counts = 1
return counts

def mklst(ln):
from random import choice
items = ['abc', 'def', 'ghi', 'jkl', 'mno',
'pqr', 'stu', 'vwx', 'yz']
return [choice(items) for i in range(ln)]

lst100 = mklst(100)
lst10000 = mklst(10000)
lst1000000 = mklst(1000000)

def run():
from timeit import Timer

timers = [
('A100',
Timer('countA(lst100)',
'from __main__ import countA, lst100'),
10000),
('B100',
Timer('countB(lst100)',
'from __main__ import countB, lst100'),
10000),
('C100',
Timer('countC(lst100)',
'from __main__ import countC, lst100'),
10000),
('A10000',
Timer('countA(lst10000)',
'from __main__ import countA, lst10000'),
100),
('B10000',
Timer('countB(lst10000)',
'from __main__ import countB, lst10000'),
100),
('C10000',
Timer('countC(lst10000)',
'from __main__ import countC, lst10000'),
100),
('A1000000',
Timer('countA(lst1000000)',
'from __main__ import countA, lst1000000'),
10),
('B1000000',
Timer('countB(lst1000000)',
'from __main__ import countB, lst1000000'),
10),
('C1000000',
Timer('countC(lst1000000)',
'from __main__ import countC, lst1000000'),
10),
]

for name, timer, repeat in timers:
print "%s (%s times): %s" % (name,
repeat,
timer.timeit(repeat))

Bruno Desthuilliers · Mar 19, 2006

Paul Rubin a écrit :

Interesting. I wonder if you could try it with a much larger number
of distinct values in the list.

Please do, this could be interesting...

Bruno Desthuilliers · Mar 19, 2006

Justin Azoff a écrit :

Your version is only fast on lists with a very small number of unique
elements.

changing mklist to have
items = range(64) instead of the 9 item list and re-timing you will get
"better" results:

A100 (10000 times): 7.63829684258
B100 (10000 times): 1.34028482437
C100 (10000 times): 0.812223911285

A10000 (100 times): 9.78499102592
B10000 (100 times): 1.26520299911
C10000 (100 times): 0.857560873032

A1000000 (10 times): 87.6713900566
B1000000 (10 times): 12.7302949429
C1000000 (10 times): 8.35931396484

Lol !-)

So much for my benchmarking skills...

Bruno Desthuilliers · Mar 19, 2006

Justin Azoff a écrit :

Bruno Desthuilliers wrote:

Your version is only fast on lists with a very small number of unique
elements.

changing mklist to have
items = range(64) instead of the 9 item list and re-timing you will get
"better" results:

A100 (10000 times): 7.63829684258
B100 (10000 times): 1.34028482437
C100 (10000 times): 0.812223911285

A10000 (100 times): 9.78499102592
B10000 (100 times): 1.26520299911
C10000 (100 times): 0.857560873032

A1000000 (10 times): 87.6713900566
B1000000 (10 times): 12.7302949429
C1000000 (10 times): 8.35931396484

Lol ! So much for my benchmarking skills :-/

per9000 · Mar 20, 2006

Dear counters,

I have also encountered the above problem, and I frequently use the two
blocks of code below.

The first one is commented already - except that this version does not
start from an empty dict.

The second "inverts" the first one. Meaning that a dict in word2hits
style:

my_dict['the'] = 120
my_dict['word'] = 4
....
my_dict['gene'] = 4

becomes a dict in hits2words style:

new_dict[4] = ['word','gene', ...]
new_dict[120] = ['the', ...]

I used these to count genes in in abstracts from pubmed the when I
"invented" them (but it took some time to remove them from the code and
use them as functions).

def dict_add(indict,inlist,init=1):
for item in inlist:
if indict.has_key(item):
indict[item] += 1
else:
indict[item] = init
return indict
# end def

def dict_invert(indict):
new_dict = {}
for key in indict.keys():
if new_dict.has_key(indict[key]):
new_dict[indict[key]].append(key)
else:
new_dict[indict[key]] = [key]
return new_dict
#end def

A good idea could be to change the header of the first one (if you want
the option to start counting from zero) into:
def dict_add(inlist,indict={},init=1):

/P9K

Peter Otten · Mar 20, 2006

per9000 said:
A good idea could be to change the header of the first one (if you want
the option to start counting from zero) into:
def dict_add(inlist,indict={},init=1):

make that

def dict_add(inlist, indict=None, init=1):
if indict is None:
indict = {}

See "5. Mutable default arguments" at
http://zephyrfalcon.org/labs/python_pitfalls.html for an explanation.

Peter

Add a list of videos each one in a different button in a web page	1	Dec 10, 2022
Is there a way to input a unique number for each array output?	4	Aug 31, 2022
I made a blockchain and want to make a cryptocurrency, but my code doesn't verify hash of each block	2	Jun 2, 2024
Decoding no of ways and printing each decode message	2	Jun 1, 2021
Average of MultiMode of a list of a list	1	Oct 28, 2022
Insert item before each element of a list	22	Oct 8, 2012
Count each unique element in list of lists	1	Nov 8, 2013
Neatest way to do a case insensitive "in"?	11	Mar 13, 2009

Counting number of each item in a list.

sophie_newbie

Paul Rubin

sophie_newbie

sophie_newbie

Kent Johnson

Bruno Desthuilliers

bearophileHUGS

Paul Rubin

Justin Azoff

Bruno Desthuilliers

Bruno Desthuilliers

Bruno Desthuilliers

Bruno Desthuilliers

Bruno Desthuilliers

Bruno Desthuilliers

per9000

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads