How to get non-unique elements from an array?

M

Mark Van Holstyn

------=_Part_70355_30315321.1129531111646
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Zach, if an element repeats more that once, itll double up in the output
array. need to add a .uniq on the end

irb(main):165:0> a =3D [1,2,3,4,5,3,5,3,5]
=3D> [1, 2, 3, 4, 5, 3, 5, 3, 5]
irb(main):166:0> t=3D[]; a.delete_if{ |e| r=3D(not t.include? e); t.push(e)=
; r }
=3D> [3, 5, 3, 5]

That aside, this one is a little shorter, and seems to benchmark a touch
faster as well.

irb(main):180:0> a=3D[1,2,3,4,5,3,5,3,5]
=3D> [1, 2, 3, 4, 5, 3, 5, 3, 5]
irb(main):181:0> t=3D[]; a.select{ |e| r=3Dt.include?e; t<<e; r }.uniq
=3D> [3, 5]
irb(main):182:0> a
=3D> [1, 2, 3, 4, 5, 3, 5, 3, 5]

Mark

------=_Part_70355_30315321.1129531111646--
 
M

Martin DeMello

Sam Kong said:
Hello!

I need to get non-unique elements from an array.
The best I came up with was using a hash as a counter for each unique
elements.

a = [0,1,2,3,4,5,2,3]

#What I want to get is [2,3] as 2,3 are non-unique elements.

Without counting:

h1 = {}
h2 = {}
a.each {|i|
h2 = true if h1
h1 = true
}

p h2.keys

martin
 
R

Relm

I need to get non-unique elements from an array.
The best I came up with was using a hash as a counter for each unique
elements.

a = [0,1,2,3,4,5,2,3]

#What I want to get is [2,3] as 2,3 are non-unique elements.

Two variants using regular expressions:

eval(a.inspect.gsub(/\b(\d+),(?!.*\b\1\b)/,'')).uniq

a.sort.inspect.scan(/\b(\d+)(, \1\b)+/).map{|x|x[0].to_i}
 
P

pauldacus

Young grasshoppers... let us never forget the power of the REGEX! Let
us not bother with counters or blocks... let us GSUB this bad boy!

a = [0,1,2,3,4,5,2,3]

a.sort.to_s.gsub(/(.)(\1)+/, '').split(//)

=>["0", "1", "4", "5"]

Sure... it returns the elements as strings, but a quick "to_i" call in
the inevitable iterator that is coming solves that. This thing is also
subject to a Schwartzian Transform of the "map-sort-map" type.

Paul Dacus
 
S

Simon Strandgaard

Young grasshoppers... let us never forget the power of the REGEX! Let
us not bother with counters or blocks... let us GSUB this bad boy!

a =3D [0,1,2,3,4,5,2,3]

a.sort.to_s.gsub(/(.)(\1)+/, '').split(//)

=3D>["0", "1", "4", "5"]


no need to sort.. :)

a =3D [0, 1, 2, 3, 4, 5, 2, 3]
a.to_s.delete(a.to_s.gsub(/(.)(?!.*\1)/,'')).split('').map{|i|i.to_i}
#=3D> [0, 1, 4, 5]
 
P

pauldacus

Well, until someone catches the horrible gaff on my last try by solving
the EXACT OPPOSITE PROBLEM WHILE BEING CONDESCENDING (my wife is used
to this), I'll keep trying:

Of course, the following is what works, finding duplicates ELEMENTS:

a.sort.to_s.scan(/(.)(\1)+/).flatten.uniq

Yeah... I guess it's clear this finds individual duplicate ELEMENTS,
not numbers. Works on single digits/letters/etc, blows up on all else.
How's that for useless?

What would do the trick is a 'non-greedy' array subtraction:

Normal(greedy): [0,1,2,3,4,5,2,3] - [0,1,2,3,4,5] = []
Non-greedy: [0,1,2,3,4,5,2,3] - [0,1,2,3,4,5] = [2,3]

Seems like there has got to be a way of solving this w/o counters or
blocks...
 
R

Ryan Leavengood

Young grasshoppers... let us never forget the power of the REGEX! Let
us not bother with counters or blocks... let us GSUB this bad boy!

a =3D [0,1,2,3,4,5,2,3]

a.sort.to_s.gsub(/(.)(\1)+/, '').split(//)

=3D>["0", "1", "4", "5"]


no need to sort.. :)

a =3D [0, 1, 2, 3, 4, 5, 2, 3]
a.to_s.delete(a.to_s.gsub(/(.)(?!.*\1)/,'')).split('').map{|i|i.to_i}
#=3D> [0, 1, 4, 5]

Both are interesting solutions, but you are returning the unique
elements of the array, not the non-unique ones, which was the original
problem. I say this not to be pedantic, but to see how you would solve
it.

Ryan
 
S

Simon Strandgaard

Young grasshoppers... let us never forget the power of the REGEX! Le= t
us not bother with counters or blocks... let us GSUB this bad boy!

a =3D [0,1,2,3,4,5,2,3]

a.sort.to_s.gsub(/(.)(\1)+/, '').split(//)

=3D>["0", "1", "4", "5"]


no need to sort.. :)

a =3D [0, 1, 2, 3, 4, 5, 2, 3]
a.to_s.delete(a.to_s.gsub(/(.)(?!.*\1)/,'')).split('').map{|i|i.to_i}
#=3D> [0, 1, 4, 5]

Both are interesting solutions, but you are returning the unique
elements of the array, not the non-unique ones, which was the original
problem. I say this not to be pedantic, but to see how you would solve
it.

Sorry should be without the delete..

a =3D [0, 1, 2, 3, 4, 5, 2, 3]
a.to_s.gsub(/(.)(?!.*\1)/,'').split('').map{|i|i.to_i}
#=3D> [2, 3]
 
J

Jeff Wood

How 'bout

a =3D [ 0,1,2,3,4,5,2,3 ]
a.uniq.inject(a.dup) { |b,i| b.delete_at(b.index(i)) ; b }

Young grasshoppers... let us never forget the power of the REGEX! = Let
us not bother with counters or blocks... let us GSUB this bad boy!

a =3D [0,1,2,3,4,5,2,3]

a.sort.to_s.gsub(/(.)(\1)+/, '').split(//)

=3D>["0", "1", "4", "5"]


no need to sort.. :)

a =3D [0, 1, 2, 3, 4, 5, 2, 3]
a.to_s.delete(a.to_s.gsub(/(.)(?!.*\1)/,'')).split('').map{|i|i.to_i}
#=3D> [0, 1, 4, 5]

Both are interesting solutions, but you are returning the unique
elements of the array, not the non-unique ones, which was the original
problem. I say this not to be pedantic, but to see how you would solve
it.

Sorry should be without the delete..

a =3D [0, 1, 2, 3, 4, 5, 2, 3]
a.to_s.gsub(/(.)(?!.*\1)/,'').split('').map{|i|i.to_i}
#=3D> [2, 3]
 
P

pauldacus

That '?!' zero width lookahead regex is nice, Simon.

A workable alternative, with nary a block or hash counter to offend the
eye:

a.sort.join(' ').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq

This works for all numbers, not just single digits. Returns a string,
which you will 'to_i' in an iterator. Replace the \d with \w for
something to find dup strings... I think.
 
S

Sam Kong

Hi!

I'm the OP.
I thank those who replied to my message.
I learned a lot and seeing all different approaches was fun.

Somebody emailed me with his own solution which I like very much.
I want to share it with the group.

a = [0,1,2,3,4,5,2,3]
a.select{|e| a.index(e) != a.rindex(e)}.uniq

It is very brief and still very easy to understand.
Don't you think so?

I wonder if my slightly modified version works.

a.select{|e| a.grep(e).length > 1}.uniq

grep is for pattern but seems to work fine for integers.
I'm not sure though. (Simple tests show it works.)


In my opinion, using regex is too much for this problem (maybe
slower?).
Array#inject seems to be good but I get confused whenever I come across
it (maybe I'm not smart enough).

What do you think?

Thanks all!

Sam
 
R

Ryan Leavengood

I wonder if my slightly modified version works.

It works, but is considerably slower (I've included a few other recent
submissions as well):

require 'benchmark'
include Benchmark

a =3D [0,1,2,3,4,5,2,3]

BMN =3D 50000
bm(5) do |bx|
bx.report('index ') { BMN.times { a.select{|e| a.index(e) !=3D
a.rindex(e)}.uniq } }
bx.report('grep ') { BMN.times { a.select{|e| a.grep(e).length > 1}.uniq=
} }
bx.report('scan ') { BMN.times { a.sort.join('
').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq } }
bx.report('inject') { BMN.times { a.uniq.inject(a.dup) { |b,i|
b.delete_at(b.index(i)) ; b } } }
bx.report('gsub ') { BMN.times {
a.to_s.gsub(/(.)(?!.*\1)/,'').split('').map{|i|i.to_i} } }
end

Results on my laptop:

user system total real
index 1.281000 0.000000 1.281000 ( 1.281000)
grep 3.813000 0.015000 3.828000 ( 3.827000)
scan 3.515000 0.000000 3.515000 ( 3.531000)
inject 1.594000 0.000000 1.594000 ( 1.593000)
gsub 3.063000 0.016000 3.079000 ( 3.078000)

Whoever created that index and rindex based solution created a very
fast one. I'll leave it as an exercise for the reading to benchmark
all the other solutions.

Ryan
 
P

pauldacus

That is a good one.

Personally, I was curious though when you said you thought regex-ing it
was slower or 'too much', if it really would be on huge arrays.

And so a test:

(REGEX) a.sort.join(' ').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq
vs.
(INDEX) a.select{|e| a.index(e) != a.rindex(e)}.uniq

I ran these on an array with 11,000 elements, 2,000 of which were
duplicates, as follows:

(1..10000).to_a + (2000..3000).to_a

The scan (REGEX) script finished in 0.21 seconds
The (INDEX) script ran in 28.77 seconds

The regex was over 100X faster!

And I actually was trying to run this with 110,000 elements but the
INDEX script seem to just lock up. The regex was done in 2.63 seconds,
and the INDEX script was still going 10 minutes later....

Have no doubt... a scan is darn fast!
 
M

Mark Van Holstyn

------=_Part_86147_19898330.1129581637841
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Hey Sam,

I too tried a solution similar to...
a.select{|e| a.index(e) !=3D a.rindex(e)}.uniq

Then i tried the one which i wrote...
t=3D[]; a.select{ |e| r=3Dt.include?e; t<<e; r }.uniq

The second runs MUCH faster. The first one must search the array twice for
each element in in, and therefor runs something on the order of n^3 worst
case. The second only searches once for each and runs n^2 worst case. As fo=
r
the version with grep... Also, a.grep seems to run somewhere on the order o=
f
n^4.


Adding mine to ryan's bench mark i come out with...

a =3D [0,1,2,3,4,5,2,3]

BMN =3D 50000
bm(5) do |bx|
bx.report('index ') { BMN.times { a.select{|e| a.index(e) !=3D a.rindex(e)}=
uniq
} }
bx.report('grep ') { BMN.times { a.select{|e| a.grep(e).length > 1}.uniq } =
}
bx.report('scan ') { BMN.times { a.sort.join('
').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq } }
bx.report('inject ') { BMN.times { a.uniq.inject(a.dup) { |b,i| b.delete_at=
(
b.index(i)) ; b } } }
bx.report('gsub ') { BMN.times {
a.to_s.gsub(/(.)(?!.*\1)/,'').split('').map{|i|i.to_i}
} }
bx.report('include') { BMN.times { t=3D[]; a.select{ |e| r=3Dt.include?e; t=
<<e;
r }.uniq } }
end

user system total real
index 1.312000 0.000000 1.312000 ( 1.328000)
grep 3.594000 0.000000 3.594000 ( 3.625000)
scan 3.094000 0.000000 3.094000 ( 3.109000)
inject 1.406000 0.000000 1.406000 ( 1.407000)
gsub 2.844000 0.000000 2.844000 ( 2.843000)
include 1.109000 0.000000 1.109000 ( 1.110000)



Mark

Hi!

I'm the OP.
I thank those who replied to my message.
I learned a lot and seeing all different approaches was fun.

Somebody emailed me with his own solution which I like very much.
I want to share it with the group.

a =3D [0,1,2,3,4,5,2,3]
a.select{|e| a.index(e) !=3D a.rindex(e)}.uniq

It is very brief and still very easy to understand.
Don't you think so?

I wonder if my slightly modified version works.

a.select{|e| a.grep(e).length > 1}.uniq

grep is for pattern but seems to work fine for integers.
I'm not sure though. (Simple tests show it works.)


In my opinion, using regex is too much for this problem (maybe
slower?).
Array#inject seems to be good but I get confused whenever I come across
it (maybe I'm not smart enough).

What do you think?

Thanks all!

Sam


--
Mark Van Holstyn
(e-mail address removed)
http://lotswholetime.com

------=_Part_86147_19898330.1129581637841--
 
R

Ryan Leavengood

And so a test:

(REGEX) a.sort.join(' ').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq
vs.
(INDEX) a.select{|e| a.index(e) !=3D a.rindex(e)}.uniq

I ran these on an array with 11,000 elements, 2,000 of which were
duplicates, as follows:

(1..10000).to_a + (2000..3000).to_a

The scan (REGEX) script finished in 0.21 seconds
The (INDEX) script ran in 28.77 seconds

The regex was over 100X faster!

Thanks for this. I was wondering if my method of benchmarking might be
flawed, and it sort of is in this case. But it makes sense here
because the scan just has to make one run through the string, whereas
each iteration of the select has multiple calls to an index which
might potentially traverse the entire array. As my benchmark shows
though, a lot of searches in a smaller array is faster using the
index.

Also another problem with the string scanning is that it depends on
the array elements being numbers (at least in the case above.)

I decided to benchmark again using your method and adding some more
implementations. Here we go (beware of wrapped lines):

require 'benchmark'
include Benchmark

[[[0,1,2,3,4,5,2,3], 50000],[(1..5000).to_a + (2000..2500).to_a,
1]].each do |a, iterations|
bm(12) do |bx|
bx.report('simonk') { iterations.times { a.select{|e| a.index(e)
!=3D a.rindex(e)}.uniq } }
bx.report('samk ') { iterations.times { a.select{|e|
a.grep(e).length > 1}.uniq } }
bx.report('paul ') { iterations.times { a.sort.join('
').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq } }
bx.report('jeff ') { iterations.times { a.uniq.inject(a.dup) {
|b,i| b.delete_at(b.index(i)) ; b } } }
bx.report('paolo ') { iterations.times { h =3D {};u =3D a.inject([])
{|res, x| h[x] ? res + [x] : h[x] =3D res}.uniq } }
bx.report('jegII ') { iterations.times { seen =3D
Hash.new(0);a.select { |e| (seen[e] +=3D 1) > 1 }.uniq } }
bx.report('simons') { iterations.times { uary =3D a.uniq;
a.map.delete_if{|i| x=3Duary.member?(i); uary.delete(i); x} } }
bx.report('robert') { iterations.times { a.inject(Hash.new(0))
{|h,e| h[e]+=3D1;h}.select {|k,v|v>1}.map {|x,y|x} } }
bx.report('david ') { iterations.times { a.uniq.find_all {|x|
a.find_all {|y| y =3D=3D x }.size > 1 } } }
bx.report('zach ') { iterations.times { t=3D[]; a.delete_if{ |e|
r=3D(not t.include? e); t.push(e); r }.uniq } }
bx.report('markvh') { iterations.times { t=3D[]; a.select{ |e|
r=3Dt.include?e; t<<e; r }.uniq } }
bx.report('martin') { iterations.times { h1 =3D {};h2 =3D {};a.each
{|i| h2 =3D true if h1;h1 =3D true };h2.keys } }
end
end

The results:

Small array, lots of iterations:
user system total real
simonk 1.344000 0.000000 1.344000 ( 1.344000)
samk 3.859000 0.000000 3.859000 ( 3.876000)
paul 3.532000 0.000000 3.532000 ( 3.547000)
jeff 1.640000 0.000000 1.640000 ( 1.641000)
paolo 2.656000 0.000000 2.656000 ( 2.672000)
jegII 1.704000 0.016000 1.720000 ( 1.719000)
simons 2.109000 0.000000 2.109000 ( 2.110000)
robert 3.172000 0.015000 3.187000 ( 3.203000)
david 5.078000 0.000000 5.078000 ( 5.079000)
zach 0.328000 0.000000 0.328000 ( 0.328000)
markvh 0.391000 0.000000 0.391000 ( 0.391000)
martin 0.390000 0.000000 0.390000 ( 0.484000)

Big array, one iteration:
user system total real
simonk 4.125000 0.000000 4.125000 ( 4.453000)
samk 17.125000 0.000000 17.125000 ( 17.283000)
paul 0.063000 0.000000 0.063000 ( 0.062000)
jeff 0.109000 0.000000 0.109000 ( 0.110000)
paolo 0.031000 0.000000 0.031000 ( 0.031000)
jegII 0.016000 0.000000 0.016000 ( 0.016000)
simons 2.219000 0.000000 2.219000 ( 2.265000)
robert 0.031000 0.000000 0.031000 ( 0.032000)
david 34.188000 0.016000 34.204000 ( 35.127000)
zach 2.156000 0.000000 2.156000 ( 2.203000)
markvh 0.015000 0.000000 0.015000 ( 0.016000)
martin 0.000000 0.000000 0.000000 ( 0.000000)

Overall Martin's was the best. As he suspected, David's was quite
slow. Sam's as well.

But it is interesting to see the varying results for each algorithm
based on how they are used.

Ryan
 
R

Ryan Leavengood

The results:

Small array, lots of iterations:
user system total real
markvh 0.391000 0.000000 0.391000 ( 0.391000)
martin 0.390000 0.000000 0.390000 ( 0.484000)

Big array, one iteration:
user system total real
markvh 0.015000 0.000000 0.015000 ( 0.016000)
martin 0.000000 0.000000 0.000000 ( 0.000000)

Overall Martin's was the best. As he suspected, David's was quite
slow. Sam's as well.

On second thought, Mark's may have been the best overall. Either way,
both Mark's and Martin's are good algorithms. Many of the others were
good as well.

Ryan
 
P

pauldacus

Awesome! What seemed like a fairly simple question... not so! I'm
going to look over all the answers again. The benchmarking is awesome.
 
L

Lyndon Samson

------=_Part_3515_23621305.1129590967768
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

user system total real
index 1.344000 0.000000 1.344000 ( 1.437000)
grep 3.875000 0.000000 3.875000 ( 4.266000)
scan 2.703000 0.000000 2.703000 ( 3.188000)
inject 1.625000 0.000000 1.625000 ( 1.812000)
gsub 3.078000 0.000000 3.078000 ( 3.547000)
bits 0.969000 0.000000 0.969000 ( 1.062000)
include 1.250000 0.000000 1.250000 ( 1.469000)

BMN =3D 50000
bm(5) do |bx|
bx.report('index ') { BMN.times { a.select{|e| a.index(e) !=3Da.rindex(e)}.=
uniq
} }
bx.report('grep ') { BMN.times { a.select{|e| a.grep(e).length > 1}.uniq } =
}
bx.report('scan ') { BMN.times {
a.sort.join('').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq
} }
bx.report('inject') { BMN.times { a.uniq.inject(a.dup) { |b,i|b.delete_at(
b.index(i)) ; b } } }
bx.report('gsub ') { BMN.times
{a.to_s.gsub(/(.)(?!.*\1)/,'').split('').map{|i|i.to_i}
} }
bx.report('bits ') { BMN.times {
bn =3D bn2 =3D 0
a.each {|el|(bn2 |=3D (1<<el) if bn[el] =3D=3D 1) | bn |=3D (1<<el) }
#0.upto(a.length) { |bitNdx| puts bitNdx if bn2[bitNdx] =3D=3D 1 }
} }
bx.report('include') { BMN.times { t=3D[]; a.select{ |e| r=3Dt.include?e; t=
<<e;r
}.uniq } }

end

Nyah nyah, unlimited length bitstrings win :) No output array of course bu=
t
the results are encoded in Fixnum/Bignum bits.

------=_Part_3515_23621305.1129590967768--
 
Z

Zach Dennis

Ryan said:
On second thought, Mark's may have been the best overall. Either way,
both Mark's and Martin's are good algorithms. Many of the others were
good as well.

Mine isn't the shortest, but it's the fastest I've seen thus far! Twice
as fast as the next closest! Why does everyone leave me out of benchmark, ;)

---my code
t=[]; a.delete_if{ |e| r=(not t.include? e); t.push(e); r }
---end my code
"c:\ruby\bin\ruby.exe" find_non-uniques.rb -d
user system total real
delete 0.218000 0.000000 0.218000 ( 0.218000) <---- MINE!!!
index 0.500000 0.000000 0.500000 ( 0.547000)
grep 0.500000 0.000000 0.500000 ( 0.516000)
scan 1.078000 0.000000 1.078000 ( 1.172000)
inject 0.656000 0.000000 0.656000 ( 0.672000)
gsub 1.250000 0.000000 1.250000 ( 1.250000)
include 0.672000 0.000000 0.672000 ( 0.687000)

Zach
 
Z

Zach Dennis

Oh wait... I just saw Ryan's other email... he didn't forget and I'm not
necessarily the fastest depending on the route you go. =)

Zach
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,181
Messages
2,570,970
Members
47,536
Latest member
VeldaYoung

Latest Threads

Top