Set operations on arrays of non-trivial objects

J

Jon Egil Stand

# MAIN QUESTION

# Is there a nice and fast way to do set operations on arrays of
# non-trivial objects.


# BACKGROUND

# Set operations are beatiful:

a = [1,2,3,4,5]
b = [2,4,6,8,10]

(a & b)
# => [2,4]


# When comparing lists, which is more or less what administration of
# pension schemes is all about, this is very, very nice.

# I very often use stuff like
(a - (a&b))

# => [1,3,5]

# The array library is blazingly fast, I'm really happy with it.


# The challenge arises when my lists don't comprise of fixnums, but instead of
# more specified objects. To simplify:

class Person
def initialize(name, ssid)
@name = name
@ssid = ssid
end

attr_reader :name, :ssid
end

list = []
list << Person.new('Peter Zapffe', 1)
list << Person.new('Peter Pan', 2)
list << Person.new('Saint Peter', 3)

# Let's say I want to UNION that to (b = [2,4,6,8,10]) from above, using ssid
# as key.It should in that case return the 'Peter Pan'-person, since he is the
# only one with an ssid included i the list.

(list & b)
# => []

# I can do stuff like:

list_ssid = list.collect{|p| p.ssid}
# => [1,2,3]

union = list_ssid & b
# => [2]

list.select{|p| union.include? p.ssid}
# => [#<Person:0x2b8b300 @name="Peter Pan", @ssid=2>]


# This works, and correctness is always nice, but it doesn't scale very nice,
# basicly because union.include? scans the union-list from scratch every time.

# I have implemented this before, by sorting both lists and stepping through
# them one at a time. That's still correct and much faster, but I have
a feeling
# it's possible to do it in a much more elegant way, using some sort of
# set-operations.

# I would appreciate any hints and or pointers.

# Thank's for reading.
# JE
 
R

Robert Klemme

# MAIN QUESTION

# Is there a nice and fast way to do set operations on arrays of
# non-trivial objects.


# BACKGROUND

# Set operations are beatiful:

a = [1,2,3,4,5]
b = [2,4,6,8,10]

(a & b)
# => [2,4]


# When comparing lists, which is more or less what administration of
# pension schemes is all about, this is very, very nice.

# I very often use stuff like
(a - (a&b))

# => [1,3,5]

# The array library is blazingly fast, I'm really happy with it.


# The challenge arises when my lists don't comprise of fixnums, but
instead of
# more specified objects. To simplify:

class Person
def initialize(name, ssid)
@name = name
@ssid = ssid
end

attr_reader :name, :ssid
end

list = []
list << Person.new('Peter Zapffe', 1)
list << Person.new('Peter Pan', 2)
list << Person.new('Saint Peter', 3)

# Let's say I want to UNION that to (b = [2,4,6,8,10]) from above, using
ssid
# as key.It should in that case return the 'Peter Pan'-person, since he
is the
# only one with an ssid included i the list.

(list & b)
# => []

# I can do stuff like:

list_ssid = list.collect{|p| p.ssid}
# => [1,2,3]

union = list_ssid & b
# => [2]

list.select{|p| union.include? p.ssid}
# => [#<Person:0x2b8b300 @name="Peter Pan", @ssid=2>]

You can do this simpler:

list.select {|p| b.include? p.ssid}

Note, make b a Set for more efficiency (see below).
# This works, and correctness is always nice, but it doesn't scale very
nice,
# basicly because union.include? scans the union-list from scratch
every time.

# I have implemented this before, by sorting both lists and stepping
through
# them one at a time. That's still correct and much faster, but I have
a feeling
# it's possible to do it in a much more elegant way, using some sort of
# set-operations.

# I would appreciate any hints and or pointers.

# Thank's for reading.
# JE

First, when you work with sets you should probably use Set - if your
sets get large you will notice the speed difference.

irb(main):002:0> a = [1,2,3,4,5].to_set
=> #<Set: {5, 1, 2, 3, 4}>
irb(main):003:0> b = [2,4,6,8,10].to_set
=> #<Set: {6, 2, 8, 4, 10}>
irb(main):004:0> a & b
=> #<Set: {2, 4}>

Secondly, it seems you're basically creating subsets based on some
criteria. If you just had a single key, you could define your class's
equivalence (#hash, #eql?, #==) to reflect that and use the approach you
tried, i.e. using set unions.

The ad hoc solution is to actually use #select as you did. But if the
criteria change, you rather need a structure that resembles an RDBMS
table which can have any number of indexes. If you have to do this
multiple times probably also with varying criteria you might want to
create a wrapper around a Set that can support arbitrary number of
indexes (Hashes) and keeps those consistent much like an RDBMS does.

You might find some more information in the archives of this list.

Kind regards

robert
 
J

Jon Egil Stand

Robert

Thank you, I will heed your advice. Set is a nice library, and it's
use of Hash for blazingly fast lookup scales very well.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top