Improving performance of hash math

D

dblock

I am trying to improve performance of Euclidian distance between two
hash maps. The code is pretty straightforward:

def self.euclidian_distance(lhs, rhs)
@s = lhs.inject(0.0) do |s, (k,v)|
rhs.has_key?(k) ? (s + (v - rhs[k])**2) : (s + v**2)
end
@s = rhs.inject(@s) do |s, (k,v)|
lhs.has_key?(k) ? s : (s + v**2)
end
@s
end

Any suggestions?
 
D

Dido Sevilla

VHJ5IG5vdCB0byB1c2UgdGhlIGV4cG9uZW50aWF0aW9uIG9wZXJhdG9yIHRvIHNxdWFyZSBhcmd1
bWVudHMuIERpcmVjdApleHBvbmVudGlhdGlvbiBpcyB2ZXJ5IGluZWZmaWNpZW50IGV4Y2VwdCBm
b3IgbGFyZ2UgdmFsdWVzLiBJdCBpcwphbG1vc3QgYWx3YXlzIGJldHRlciB0byB1c2Ugc3RyYWln
aHRmb3J3YXJkIG11bHRpcGxpY2F0aW9uIHdoZW4geW91cgpwb3dlcnMgYXJlIHJlYXNvbmFibHkg
c21hbGwuIE90aGVyIHRoYW4gdGhpcyBJIGNhbid0IHRoaW5rIG9mIGFueSB3YXkKdG8gaW1wcm92
ZSBwZXJmb3JtYW5jZSBzaG9ydCBvZiBkcm9wcGluZyB0byBDLCBhbmQgZXZlbiB0aGVyZSBJIHRo
aW5rCnRoZSBwZXJmb3JtYW5jZSBpbXByb3ZlbWVudCB5b3UnbGwgZ2V0IHdvdWxkIGJlIG1hcmdp
bmFsLgoKSGF2ZSB5b3UgdHJpZWQgcHJvZmlsaW5nIHlvdXIgY29kZT8gVGhlIHByb2ZpbGVyIGNh
biB0ZWxsIHlvdSBhIGxvdAphYm91dCB3aGVyZSBpdCBhY3R1YWxseSBzcGVuZHMgdGltZSBkb2lu
ZyBzdHVmZiBhbmQgY2FuIGJlIHZlcnkKZW5saWdodGVuaW5nIGFib3V0IG9wdGltaXphdGlvbi4K
Ci0tIArmma7pgJrjgZjjgoPjgarjgYTjga7jgYzlvZPnhLbjgarjgonnrZTjgYjjgovnp4Hjga/k
vZXjgYzjgafjgY3jgovvvJ8K5pmu6YCa44Gn44KC5pmu6YCa44GY44KD44Gq44GP44Gm5oSf44GY
44KL44G+44G+5oSf44GY44KL44GT44Go44Gg44GR44KS44GZ44KL44KI77yBCmh0dHA6Ly9zdG9y
bXd5cm0uYmxvZ3Nwb3QuY29tCg==
 
G

Gary Wright

I am trying to improve performance of Euclidian distance between two
hash maps. The code is pretty straightforward:
=20
def self.euclidian_distance(lhs, rhs)
@s =3D lhs.inject(0.0) do |s, (k,v)|
rhs.has_key?(k) ? (s + (v - rhs[k])**2) : (s + v**2)
end
@s =3D rhs.inject(@s) do |s, (k,v)|
lhs.has_key?(k) ? s : (s + v**2)
end
@s
end
=20
Any suggestions?
=20

Switch to using #each with an explicit accumulator variable instead of =
#inject.

Gary Wright=
 
D

dblock

Switch to using #each with an explicit accumulator variable instead of #inject.

Looks about twice as slow. Either way, about 36% of the time is spent
in Hash.each.
 
D

dblock

Here're some actual numbers. I am doing the above code 10K times with
a data distribution that's somewhat not-random (some overlap in keys).

%self total self wait child calls name
37.61 1.49 0.99 0.00 0.50 61446 Hash#each
10.00 0.26 0.26 0.00 0.00 419946 Float#+
 
R

Robert Klemme

I am trying to improve performance of Euclidian distance between two
hash maps. The code is pretty straightforward:

def self.euclidian_distance(lhs, rhs)
=A0 =A0@s =3D lhs.inject(0.0) do |s, (k,v)|
=A0 =A0 =A0rhs.has_key?(k) ? (s + (v - rhs[k])**2) : (s + v**2)
=A0 =A0end
=A0 =A0@s =3D rhs.inject(@s) do |s, (k,v)|
=A0 =A0 =A0lhs.has_key?(k) ? s : (s + v**2)
=A0 =A0end
=A0 =A0@s
end

Any suggestions?

Use Hash's default value mechanism or use "|| 0" and kick out
#has_key? calls. I'd also rather use a proper Matrix or Vector class
instead of a Hash. NArray also seems like a good and fast option.

Btw, why are you using an instance variable here? That does not seem
to make any sense.

Also, it seems that your calculation is flawed because you calculate
the distance for keys which are present in both Hashes twice. I'd
rather

require 'set'

def euclidian_distance(lhs, rhs)
s =3D 0

(lhs.keys.to_set.merge(rhs.keys)).each do |k|
s +=3D ( (rhs[k] || 0) - (lhs[k] || 0) ) ** 2
end

s
end


Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
D

dblock

Use Hash's default value mechanism or use "|| 0" and kick out
#has_key? calls.  I'd also rather use a proper Matrix or Vector class
instead of a Hash.  NArray also seems like a good and fast option.

- has_key has almost no cost, made no difference
- my data is sparse, there're a lot of zero values - using an array
forces you to do N calculations (N is the max width of the array), an
Array was 3x slower, especially with a fast growing data set (haven't
tried NArray)
Btw, why are you using an instance variable here?  That does not seem
to make any sense.

- fixed
Also, it seems that your calculation is flawed because you calculate
the distance for keys which are present in both Hashes twice.  I'd
rather

- re-read the core, you're wrong, the second half only adds values
that are not present in the first array
require 'set'

def euclidian_distance(lhs, rhs)
  s = 0

  (lhs.keys.to_set.merge(rhs.keys)).each do |k|
    s += ( (rhs[k] || 0) - (lhs[k] || 0) ) ** 2
  end

  s
end

- this is 3x slower because of the cost of to_set.merge (the code is
nicer though :)
 
G

Gary Wright

of #inject.
=20
Looks about twice as slow. Either way, about 36% of the time is spent
in Hash.each.
=20

Surprising since most of the 'inject solutions' that show up on this =
list are slower than using #each.

Can you post some sample data?

Gary Wright=
 
R

Ryan Davis

of #inject.
=20
Looks about twice as slow. Either way, about 36% of the time is spent
in Hash.each.

I don't agree with Gary's suggestion, because in this case it is a =
proper application of inject, but if you think it is 2x slower, then =
you're benchmarking is probably at fault.

My measurements show each to be roughly equivalent on 1.9 and quite a =
bit faster on 1.8. That could be explained by my modification to your =
treatment of rhs or it could also be explained by my fake data.

% ./q.rb 10000
# of iterations =3D 10000
user system total real
null_time 0.000000 0.000000 0.000000 ( 0.001118)
euclidian_distance1 18.960000 0.010000 18.970000 ( 19.011762)
euclidian_distance2 13.050000 0.010000 13.060000 ( 13.069581)
euclidian_distance3 13.090000 0.010000 13.100000 ( 13.109778)

against:

def euclidian_distance1(lhs, rhs)
@s =3D lhs.inject(0.0) do |s, (k,v)|
rhs.has_key?(k) ? (s + (v - rhs[k])**2) : (s + v**2)
end
@s =3D rhs.inject(@s) do |s, (k,v)|
lhs.has_key?(k) ? s : (s + v**2)
end
@s
end

def euclidian_distance2(lhs, rhs)
s =3D 0.0
lhs.each do |k,v|
s +=3D rhs.has_key?(k) ? ((v - rhs[k])**2) : (v**2)
end
rhs.each do |k,v|
s +=3D v**2 if lhs.has_key?(k)
end
s
end

def euclidian_distance3(lhs, rhs)
s =3D lhs.inject(0.0) do |n, (k,v)|
n + (rhs.has_key?(k) ? v - rhs[k] : v)**2
end
rhs.each do |k,v|
s +=3D v**2 if lhs.has_key? k
end
s
end
 
D

dblock

In case someone reads this for the purpose of the algorithms, and not
that it influences benchmarks, but the last two have bugs: the second
sum needs to do ! lhs.has_key? k. You want to do (right - left)
** 2 for the left hash (items in the left hash only and in both the
left and the right hash), then add right for all items that are in
the right hash only.

Back to the original question: anyone thinks we can do this for
millions of items?
 
G

Gary Wright

=20
I don't agree with Gary's suggestion, because in this case it is a =
proper application of inject, but if you think it is 2x slower, then =
you're benchmarking is probably at fault.

I'm OK with being wrong. I didn't actually try my own suggestion so I'm =
not going to defend it too hard but I did notice that your =
euclidian_distance2 version that uses #each was 30% faster than =
euclidian_distance1 that used #inject. Your euclidian_distance3 version =
uses both #inject and #each but would seem to be sensitive to the size =
of the lhs argument (i.e. the number of entries in the lhs hash).

I'd like to see the sample data to better understand what is going on.

Gary Wright

=20
My measurements show each to be roughly equivalent on 19 and quite a =
bit faster on 1.8. That could be explained by my modification to your =
treatment of rhs or it could also be explained by my fake data.
=20
% ./q.rb 10000
# of iterations =3D 10000
user system total real
null_time 0.000000 0.000000 0.000000 ( 0.001118)
euclidian_distance1 18.960000 0.010000 18.970000 ( 19.011762)
euclidian_distance2 13.050000 0.010000 13.060000 ( 13.069581)
euclidian_distance3 13.090000 0.010000 13.100000 ( 13.109778)
=20
against:
=20
def euclidian_distance1(lhs, rhs)
@s =3D lhs.inject(0.0) do |s, (k,v)|
rhs.has_key?(k) ? (s + (v - rhs[k])**2) : (s + v**2)
end
@s =3D rhs.inject(@s) do |s, (k,v)|
lhs.has_key?(k) ? s : (s + v**2)
end
@s
end
=20
def euclidian_distance2(lhs, rhs)
s =3D 0.0
lhs.each do |k,v|
s +=3D rhs.has_key?(k) ? ((v - rhs[k])**2) : (v**2)
end
rhs.each do |k,v|
s +=3D v**2 if lhs.has_key?(k)
end
s
end
=20
def euclidian_distance3(lhs, rhs)
s =3D lhs.inject(0.0) do |n, (k,v)|
n + (rhs.has_key?(k) ? v - rhs[k] : v)**2
end
rhs.each do |k,v|
s +=3D v**2 if lhs.has_key? k
end
s
end
=20
=20
 
R

Robert Klemme

- has_key has almost no cost, made no difference

Still the hash needs to be looked into twice. I prefer to do the work
only once and then decide based on the result.
- my data is sparse, there're a lot of zero values - using an array
forces you to do N calculations (N is the max width of the array), an
Array was 3x slower, especially with a fast growing data set (haven't
tried NArray)

Can you post your test data in some way (either raw or a generator
which roughly creates what you have) so others can play around and
benchmark?
- re-read the core, you're wrong, the second half only adds values
that are not present in the first array

Right you are! Sorry, my fault.
require 'set'

def euclidian_distance(lhs, rhs)
=A0 s =3D 0

=A0 (lhs.keys.to_set.merge(rhs.keys)).each do |k|
=A0 =A0 s +=3D ( (rhs[k] || 0) - (lhs[k] || 0) ) ** 2
=A0 end

=A0 s
end

- this is 3x slower because of the cost of to_set.merge (the code is
nicer though :)

:)

How about

def euclidian_distance(lhs, rhs)
s =3D 0

lhs.each do |k,v|
s +=3D ( (rhs[k] || 0 ) - v ) ** 2
end

rhs.each do |k,v|
s +=3D v * v unless lhs[k]
end

s
end

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
474,141
Messages
2,570,814
Members
47,360
Latest member
kathdev

Latest Threads

Top