How to populate all possible hierarchical clusterings from a set of elements?

J

justin

The title sounds too complex, but my question is actually simple.

Suppose I have [1,2,3,4,5], then there are many ways of making
clustering.
Among them, I want to pair up terminals until there is only one left
at the end.
For example, ((((1,2),3),4),5), (1,(2,(3,(4,5)))), or (((1,2),(3,4)),
5) would be legitimate ones.

How do you think can I, using the modules of Python such as itertools
as much as possible, make all possible such clusterings?

Thanks in advance,
Justin.
 
P

Peter Otten

justin said:
The title sounds too complex, but my question is actually simple.

Suppose I have [1,2,3,4,5], then there are many ways of making
clustering.
Among them, I want to pair up terminals until there is only one left
at the end.
For example, ((((1,2),3),4),5), (1,(2,(3,(4,5)))), or (((1,2),(3,4)),
5) would be legitimate ones.

How do you think can I, using the modules of Python such as itertools
as much as possible, make all possible such clusterings?

Here's my first attempt:

def cluster(items):
if len(items) == 2:
yield items
return

for i in range(len(items)-1):
for c in cluster(items[:i] + (items[i:i+2],) + items[i+2:]):
yield c

def unique(items):
seen = set()
for item in items:
if item not in seen:
seen.add(item)
yield item

if __name__ == "__main__":
for item in unique(cluster(tuple("abcd"))):
print item

Unfortunately I get a lot of duplicates :(

You could define a kind of operator precedence using
itertools.combinations(), but I think it suffers from the same problem as

a 3 b 1 c 2 d

and

a 2 b 1 c 3 d

would both result in ((a, b), (c, d)).
 
D

DevPlayer

lst = [1, 2, 3, 4, 5]

def maketup(lst):
cur_item = lst[-1]
lst = lst[:-1]
if len(lst):
return maketup(lst), cur_item
else:
return cur_item

print maketup(lst)

((((1, 2), 3), 4), 5)

But I'm confused as to what you mean by :
Among them, I want to pair up terminals until there is only one left
at the end.

One what? one pair?, one terminal meaning one number?
 
D

DevPlayer

def maketup(lst):

if len(lst) == 1:
return lst[0]

elif len(lst) == 2:
return (lst[0],lst[1])

elif len(lst) > 2:
return ( (maketup(lst[:-2]), lst[-2]), lst[-1])

maketup(lst)

((((1, 2), 3), 4), 5)
 
A

Alain Ketterlin

justin said:
Suppose I have [1,2,3,4,5], then there are many ways of making
clustering.
Among them, I want to pair up terminals until there is only one left
at the end.

Are you trying "ascending hierarchical clustering" by any chance? In
that case you're supposed to use some kind of distance to select the
(unique) pair of elements to merge at each step.
For example, ((((1,2),3),4),5), (1,(2,(3,(4,5)))), or (((1,2),(3,4)),
5) would be legitimate ones.

How do you think can I, using the modules of Python such as itertools
as much as possible, make all possible such clusterings?

I don't know about itertools, but the basic idea is:

def clusterings(l):
if len(l) == 1:
print repr(l)
else:
n = len(l)
for i in xrange(n):
for j in xrange(i+1,n):
clusterings(l[:i]+l[i+1:j]+l[j+1:]+[[l,l[j]]])

Test this with:

import sys
clusterings([i for i in xrange(int(sys.argv[1]))])

Do you realize there are *many* such clusterings? (the exact number
should be (n!)*((n-1)!)/2^(n-1) given the code above, if I'm not
mistaken.)

-- Alain.
 
A

Alain Ketterlin

DevPlayer said:
def maketup(lst):

if len(lst) == 1:
return lst[0]

elif len(lst) == 2:
return (lst[0],lst[1])

elif len(lst) > 2:
return ( (maketup(lst[:-2]), lst[-2]), lst[-1])

The OP wants all binary trees over the elements, not just one.

-- Alain.
 
D

DevPlayer

tuple([ (tuple(lst[x-1:x+1]) if len(tuple(lst[x-1:x+1]))==2 else
lst[x-1]) for x in lst[::2]])

((1, 2), (3, 4), 5)


# or
x = ((tuple(lst[x-1:x+1]) if len(tuple(lst[x-1:x+1]))==2 else
lst[x-1]) for x in lst[::2])
x.next()
(1, 2)
x.next()
(3, 4)
x.next()
5
 
R

Richard Thomas

justin said:
Suppose I have [1,2,3,4,5], then there are many ways of making
clustering.
Among them, I want to pair up terminals until there is only one left
at the end.

Are you trying "ascending hierarchical clustering" by any chance? In
that case you're supposed to use some kind of distance to select the
(unique) pair of elements to merge at each step.
For example, ((((1,2),3),4),5), (1,(2,(3,(4,5)))), or (((1,2),(3,4)),
5) would be legitimate ones.
How do you think can I, using the modules of Python such as itertools
as much as possible, make all possible such clusterings?

I don't know about itertools, but the basic idea is:

def clusterings(l):
    if len(l) == 1:
        print repr(l)
    else:
        n = len(l)
        for i in xrange(n):
            for j in xrange(i+1,n):
                clusterings(l[:i]+l[i+1:j]+l[j+1:]+[[l,l[j]]])

Test this with:

import sys
clusterings([i for i in xrange(int(sys.argv[1]))])

Do you realize there are *many* such clusterings? (the exact number
should be (n!)*((n-1)!)/2^(n-1) given the code above, if I'm not
mistaken.)

-- Alain.


Actually the number of such "clusterings" is the (n-1)th Catalan
number.

http://en.wikipedia.org/wiki/Catalan_numbers

Chard.
 
A

Alain Ketterlin

Richard Thomas said:
On Jan 13, 10:02 am, Alain Ketterlin <[email protected]>
def clusterings(l):
    if len(l) == 1:
        print repr(l)
    else:
        n = len(l)
        for i in xrange(n):
            for j in xrange(i+1,n):
                clusterings(l[:i]+l[i+1:j]+l[j+1:]+[[l,l[j]]])
[...] there are *many* such clusterings? (the exact number should be
(n!)*((n-1)!)/2^(n-1) given the code above, if I'm not mistaken.)

Actually the number of such "clusterings" is the (n-1)th Catalan
number.

http://en.wikipedia.org/wiki/Catalan_numbers

I don't think Catalan numbers exactly captures this number. As far as I
remember (and wikipedia seems to confirm this), Cn is the number of ways
you can repeatedly apply a binary operator to a sequence of objects,
where sequence means that the objects are ordered, which is not the case
here. To use wikipedia's example, C3 is 5 because you can do:

((ab)c)d (a(bc))d (ab)(cd) a((bc)d) a(b(cd))

If we list clusterings we can also have:

((ac)b)d ((ac)d)b (ac)(bd) ...

Actually, for each of the 5 "catalan expressions" above, you have 4!
valid permutations of the objects (leading to a complete count of
n!*C(n-1)). But this leads to many "duplicates", because (ab)(cd) and
(cd)(ab) are considered the same.

I just realize that the code I've given above also produces duplicates
(in particular, the example I've just used). At least, my counting was
correct w.r.t. the code :) The plot thickens...

-- Alain.
 
R

Richard Thomas

Richard Thomas said:
def clusterings(l):
    if len(l) == 1:
        print repr(l)
    else:
        n = len(l)
        for i in xrange(n):
            for j in xrange(i+1,n):
                clusterings(l[:i]+l[i+1:j]+l[j+1:]+[[l,l[j]]])
[...] there are *many* such clusterings? (the exact number should be
(n!)*((n-1)!)/2^(n-1) given the code above, if I'm not mistaken.)

Actually the number of such "clusterings" is the (n-1)th Catalan
number.


I don't think Catalan numbers exactly captures this number. As far as I
remember (and wikipedia seems to confirm this), Cn is the number of ways
you can repeatedly apply a binary operator to a sequence of objects,
where sequence means that the objects are ordered, which is not the case
here. To use wikipedia's example, C3 is 5 because you can do:

((ab)c)d (a(bc))d (ab)(cd) a((bc)d) a(b(cd))

If we list clusterings we can also have:

((ac)b)d ((ac)d)b (ac)(bd) ...

Actually, for each of the 5 "catalan expressions" above, you have 4!
valid permutations of the objects (leading to a complete count of
n!*C(n-1)). But this leads to many "duplicates", because (ab)(cd) and
(cd)(ab) are considered the same.

I just realize that the code I've given above also produces duplicates
(in particular, the example I've just used). At least, my counting was
correct w.r.t. the code :) The plot thickens...

-- Alain.


Okay, I misunderstood the problem, sorry about that. This makes it
rather hard to define a nice recurrence relation for the number of
such clusterings:

C(1) = 1
C(2n+1) = Sigma(1; n; choose(2n+1, r) * C(r) * C(2n+1-r))
C(2n) = Sigma(1; n-1; choose(2n, r) * C(r) * C(2n-r))
+ choose(2n, n) * C(n) * C(n) / 2

See, very ugly. I can't reduce it to anything workable so I just
computed it. Clearly its more than exponential. Some values:

In [1]: [cluster(n) for n in xrange(1, 21)]
Out[1]:
[1,
1,
3,
15,
105,
945,
10395,
135135,
2027025,
34459425,
654729075,
13749310575L,
316234143225L,
7905853580625L,
213458046676875L,
6190283353629375L,
191898783962510625L,
6332659870762850625L,
221643095476699771875L,
8200794532637891559375L]

Anyway, I'm done counting things for now.

Chard.
 
A

Arnaud Delobelle

Peter Otten said:
justin said:
The title sounds too complex, but my question is actually simple.

Suppose I have [1,2,3,4,5], then there are many ways of making
clustering.
Among them, I want to pair up terminals until there is only one left
at the end.
For example, ((((1,2),3),4),5), (1,(2,(3,(4,5)))), or (((1,2),(3,4)),
5) would be legitimate ones.

How do you think can I, using the modules of Python such as itertools
as much as possible, make all possible such clusterings?

Here's my first attempt:

def cluster(items):
if len(items) == 2:
yield items
return

for i in range(len(items)-1):
for c in cluster(items[:i] + (items[i:i+2],) + items[i+2:]):
yield c

def unique(items):
seen = set()
for item in items:
if item not in seen:
seen.add(item)
yield item

if __name__ == "__main__":
for item in unique(cluster(tuple("abcd"))):
print item

more simply:

def clusters(l):
if len(l) == 1:
yield l[0]
return
for i in range(1, len(l)):
for left in clusters(l[:i]):
for right in clusters(l[i:]):
yield (left, right)

That would give all solutions without duplicates. In fact, this is
simply finding all full binary trees which order l.

However, somewhere else in the thread someone claims that no ordering is
implied on the initial list of items (which is not at all obvious from
the OP).
 
P

Peter Otten

Arnaud said:
more simply:

def clusters(l):
if len(l) == 1:
yield l[0]
return
for i in range(1, len(l)):
for left in clusters(l[:i]):
for right in clusters(l[i:]):
yield (left, right)

That would give all solutions without duplicates. In fact, this is
simply finding all full binary trees which order l.

Easy, now that I see it ;)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top