ANN: equivalence 0.1

G

George Sakkis

Equivalence is a class that can be used to maintain a partition of
objects into equivalence sets, making sure that the equivalence
properties (reflexivity, symmetry, transitivity) are preserved. Two
objects x and y are considered equivalent either implicitly (through a
key function) or explicitly by calling merge(x,y).

Get it from pypi: http://pypi.python.org/pypi/equivalence/

Example
=======
Say that you are given a bunch of URLs you want to download and
eventually process somehow. These urls may contain duplicates, either
exact or leading to a page with the same content (e.g. redirects,
plagiarized pages, etc.). What you'd like is identify duplicates in
advance so that you can process only unique pages. More formally, you
want to partition the given URLs into equivalence sets and pick a
single representative from each set.

Getting rid of identical URLs is trivial. A more general case of URLs
that can be easily identified as duplicates can be based on some
simple regular expression based heuristics, so that for instance
'http://python.org/doc/' and 'www.python.org/doc/index.html' are
deemed equivalent. For this case you may have a normalize(url)
function that reduces a URL into its "stem" (e.g. 'python.org/doc')
and use this as a key for deciding equivalence.

This is fine but it still leaves quite a few URLs that cannot be
recognized as duplicates with simple heuristics. For these harder
cases you may have one or more "oracles" (an external database, a page
comparison program, or ultimately a human) that decides whether pages
x and y are equivalent. You can integrate such oracles by explicitly
declaring objects as equivalent using Equivalence.merge(x,y).

Both implicit (key-based) and explicit information are combined to
maintain the equivalence sets. For instance:
'http://python.org/doc/
index.html')
You can find more about the API in the included docs and the unittest
file.

Regards,
George
 
G

Giuseppe Ottaviano

Equivalence is a class that can be used to maintain a partition of
objects into equivalence sets, making sure that the equivalence
properties (reflexivity, symmetry, transitivity) are preserved. Two
objects x and y are considered equivalent either implicitly (through a
key function) or explicitly by calling merge(x,y).

I think this library would be very useful, and I like the interface
(in particular the idea of the projecting function), but I think the
algorithm is less than optimal. It can show quadratic behavior in some
cases:

$ python -m timeit -s 'import equivalence' -s 'eq =
equivalence.Equivalence()' 'for i in xrange(10000): eq.merge(i, i+1)'
10 loops, best of 3: 57.6 msec per loop

$ python -m timeit -s 'import equivalence' -s 'eq =
equivalence.Equivalence()' 'for i in xrange(10000): eq.merge(i+1, i)'
10 loops, best of 3: 2.59 sec per loop

Have you considered using the Union-Find algorithm, which would be
(almost) linear in all cases?

-- Giuseppe
 
G

George Sakkis

I think this library would be very useful, and I like the interface (in
particular the idea of the projecting function),

Thanks for trying it out!
but I think the algorithm
is less than optimal. It can show quadratic behavior in some cases:

$ python -m timeit -s 'import equivalence' -s 'eq =
equivalence.Equivalence()' 'for i in xrange(10000): eq.merge(i, i+1)'
10 loops, best of 3: 57.6 msec per loop

$ python -m timeit -s 'import equivalence' -s 'eq =
equivalence.Equivalence()' 'for i in xrange(10000): eq.merge(i+1, i)'
10 loops, best of 3: 2.59 sec per loop

Interesting.. it took me a while to figure out why the second case is
so much slower and you're right, it is indeed quadratic. I don't know
how likely would such pathological cases be in practice, given that
the preferred way to merge a batch of objects is
eq.merge(*xrange(10001)), which is more than 3 times faster than the
non-pathologic first case (and which I optimized even further to avoid
attribute lookups within the loop so it's more like 5 times faster
now). Also the batch version in this case remains linear even if you
merge backwards, eq.merge(*xrange(10000,-1,-1)), or in any order for
that matter.
Have you considered using the Union-Find algorithm, which would be (almost)
linear in all cases?

I am familiar with it and I will certainly consider it for the next
version; for now I was primarily interested in functionality (API) and
correctness.

Thanks,
George
 
G

Giuseppe Ottaviano

Interesting.. it took me a while to figure out why the second case is
so much slower and you're right, it is indeed quadratic. I don't know
how likely would such pathological cases be in practice, given that
the preferred way to merge a batch of objects is
eq.merge(*xrange(10001)), which is more than 3 times faster than the
non-pathologic first case (and which I optimized even further to avoid
attribute lookups within the loop so it's more like 5 times faster
now). Also the batch version in this case remains linear even if you
merge backwards, eq.merge(*xrange(10000,-1,-1)), or in any order for
that matter.

The example just showed what could happen if the merges are done in
pathological order, it is not about batch merging. I think that
pathological cases like this indeed show up in real cases: many
algorithms of near duplicate elimination and clustering reduce to
finding connected components of a graph whose edges are given as a
stream, so you can't control their order.
With this implementation, every time a component sized N is given a
second (or following) argument to merge, you pay Omega(N).
I am familiar with it and I will certainly consider it for the next
version; for now I was primarily interested in functionality (API) and
correctness.

Good :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top