Sane #hash implementation?

  • Thread starter Shot (Piotr Szotkowski)
  • Start date
S

Shot (Piotr Szotkowski)

--OmL7C/BU0IhhC9Of
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

I have a Set subclass, Block. I need two blocks to be considered the
same by Set iff they have the same elements. From what I gathered, Set
treats two objects as equal when they are eql? and have the same hash.

Implementing Block#eql? seems to be trivial, as Set#=3D=3D is there to help:

class Block < Set
def eql? other
self =3D=3D other
end
end

I=E2=80=99m stuck when it comes to Block#hash, though; I need these to be t=
rue:
Block.new.hash =3D=3D Block.new.hash
Block.new([1,2]).hash =3D=3D Block.new([1,2]).hash

What=E2=80=99s the proper way to go at it? Adding the elements=E2=80=99 has=
hes together
is obviously wrong, as four different elements might produce the same
sum when paired; also, the docs say, =E2=80=98any hash value that exceeds t=
he
capacity of a Fixnum will be truncated before being used,=E2=80=99 so hash =
can=E2=80=99t
exceed the (word - 1 bit) size.

-- Shot
--=20
Hell hath no fury like a bureaucrat scorned. -- Milton Friedman

--OmL7C/BU0IhhC9Of
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFFvnd2i/mCfdEo8UoRAk2BAJ46G3spixNAe/Qr+K/T4X8FquSB5ACghgNL
TMZqsdmMz1FTrJqYdeOOQvI=
=Gt/7
-----END PGP SIGNATURE-----

--OmL7C/BU0IhhC9Of--
 
K

Ken Bloom

I have a Set subclass, Block. I need two blocks to be considered the
same by Set iff they have the same elements. From what I gathered, Set
treats two objects as equal when they are eql? and have the same hash.

Implementing Block#eql? seems to be trivial, as Set#== is there to help:

class Block < Set
def eql? other
self == other
end
end

I’m stuck when it comes to Block#hash, though; I need these to be true:
Block.new.hash == Block.new.hash
Block.new([1,2]).hash == Block.new([1,2]).hash

What’s the proper way to go at it? Adding the elements’ hashes together
is obviously wrong, as four different elements might produce the same
sum when paired; also, the docs say, ‘any hash value that exceeds the
capacity of a Fixnum will be truncated before being used,’ so hash can’t
exceed the (word - 1 bit) size.

-- Shot

There's nothing about hash codes that requires that
a.hash != b.hash if a != b
You want to avoid collisions as much as possible, but
that's an issue of performance, not correctness, so you'd be perfectly
correct in adding up the hash codes of the members to create the has code.

Nor is there anything that requires you to keep your hash code within the
capacity of a Fixnum when you generate it. That's why it's automatically
truncated for you. The only reason you need to know is so that your hash
function doesn't hash things in such a way that the truncated number has
lots of collisions. (Once again a performance issue).

You may want to look at http://www.partow.net/programming/hashfunctions/
which discusses some relatively important hash functions. You may have to
experiment to see what works best. Wikipedia may also have good
information.

--Ken
 
E

Eric Hodel

I have a Set subclass, Block. I need two blocks to be considered the
same by Set iff they have the same elements. =46rom what I gathered, = Set
treats two objects as equal when they are eql? and have the same hash.

I=92m stuck when it comes to Block#hash, though; I need these to be =20=
true:
Block.new.hash =3D=3D Block.new.hash
Block.new([1,2]).hash =3D=3D Block.new([1,2]).hash

Try:

class Block
def hash
to_a.hash
end
end=
 
S

Shot (Piotr Szotkowski)

--o99acAvKqrTZeiCU
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Ken Bloom:
There's nothing about hash codes that requires that
a.hash !=3D b.hash if a !=3D b
You want to avoid collisions as much as possible, but that's an issue
of performance, not correctness, so you'd be perfectly correct in
adding up the hash codes of the members to create the has code.

True, but I do want to be able to key Hashes with my Blocks, so I do
want to avoid collisions as much as possible. I should=E2=80=99ve tested the
add-up-the-members=E2=80=99-hashes approach before/instead of crossing it o=
ut,
though.
You may want to look at
http://www.partow.net/programming/hashfunctions/
which discusses some relatively important hash functions.
You may have to experiment to see what works best. Wikipedia
may also have good information.

Thanks for the pointers! I knew about hash functions before, I just
wasn=E2=80=99t sure whether there=E2=80=99s an idiomatic/popular/proper Rub=
y approach
to writing #hash.

-- Shot
--=20
I am a computer. I am dumber than any human and smarter than any administra=
tor.

--o99acAvKqrTZeiCU
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFFvvJui/mCfdEo8UoRAjggAJ9NilXow3gxAgv0/DhzdxEzHFsUcQCfSU1K
jjoYw6pf3C8ElJiMRsybwR8=
=FMI+
-----END PGP SIGNATURE-----

--o99acAvKqrTZeiCU--
 
S

Shot (Piotr Szotkowski)

--QRtLtq+kfJNLc57H
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Eric Hodel:
On Jan 29, 2007, at 14:38, Shot (Piotr Szotkowski) wrote:
I=E2=80=99m stuck when it comes to Block#hash, though; I need these to b= e true:
Block.new.hash =3D=3D Block.new.hash
Block.new([1,2]).hash =3D=3D Block.new([1,2]).hash

class Block
def hash
to_a.hash
end
end

Thanks a lot, Eric! This is the =E2=80=98d=E2=80=99oh!=E2=80=99 solution I =
was looking for. :)

-- Shot
--=20
Of course I can see iso-8859-1 characters, I'm French. -- Christian Maril=
lat

--QRtLtq+kfJNLc57H
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFFvvM3i/mCfdEo8UoRAtQlAJ9wtMQSFBxo/zH6cEwTFJ5sh4LznACghTX/
BQdDTFcH9TwSFHNQ1i/LEeU=
=GaVm
-----END PGP SIGNATURE-----

--QRtLtq+kfJNLc57H--
 
M

Mauricio Fernandez

Eric Hodel:
On Jan 29, 2007, at 14:38, Shot (Piotr Szotkowski) wrote:
I$B!G(Bm stuck when it comes to Block#hash, though; I need these to be
true:
Block.new.hash == Block.new.hash
Block.new([1,2]).hash == Block.new([1,2]).hash

class Block
def hash
to_a.hash
end
end

Thanks a lot, Eric! This is the $B!F(Bd$B!G(Boh!$B!G(B solution I was looking for.
:)

You may want to define #eql? in terms of #to_a as well...

That would fail for the same reason as your Block#hash.
 
P

Pit Capitain

Shot said:
Eric Hodel:
On Jan 29, 2007, at 14:38, Shot (Piotr Szotkowski) wrote:
I’m stuck when it comes to Block#hash, though; I need these to be true:
Block.new.hash == Block.new.hash
Block.new([1,2]).hash == Block.new([1,2]).hash

class Block
def hash
to_a.hash
end
end

Thanks a lot, Eric! This is the ‘d’oh!’ solution I was looking for. :)

Shot, you should be aware that this works for your example given above,
but not for the following:

Block.new([1,12]).hash # => 57
Block.new([12,1]).hash # => 23

If this is a problem, you have to change the implementation to

class Block
def hash
sort.hash
end
end

Regards,
Pit
 
M

Mauricio Fernandez

[I'm resending this since the ML server is dropping some of my messages to
this thread for some reason.]

Eric Hodel:
On Jan 29, 2007, at 14:38, Shot (Piotr Szotkowski) wrote:
I'm stuck when it comes to Block#hash, though; I need these to be true:
Block.new.hash == Block.new.hash
Block.new([1,2]).hash == Block.new([1,2]).hash

class Block
def hash
to_a.hash
end
end

Thanks a lot, Eric! This is the d'oh! solution I was looking for. :)

It's not a correct one though.

require 'set'
a = Set.new(0..1000)
b = Set.new((0..1000).sort_by{rand})
a == b # => true
a.to_a.hash # => 31141761
b.to_a.hash # => 374826672

(The order of the elements in Hash#to_a can change.)


Try this, it's O(n ln n) but actually works

require 'set'
class Block < Set
def hash
map{|x| x.hash}.sort.hash
end
end
a = Block.new(0..1000)
b = Block.new((0..1000).sort_by{rand})
a == b # => true
a.hash # => 944434529
b.hash # => 944434529
 
S

Shot (Piotr Szotkowski)

--72btQdUC6twB1rwh
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Eric Hodel:
You may want to define #eql? in terms of #to_a as well...

Thanks, but (considering Mauricio Fernandez=E2=80=99s fix) what would be the
benefit of defining #eql? in terms of Set#sort instead of Set#=3D=3D?

Te docs say on Set#=3D=3D, =E2=80=98Returns true if two sets are equal. The=
equality
of each couple of elements is defined according to Object#eql?.=E2=80=99

-- Shot
--=20
One of the advantages of being disorderly is that one is
constantly making exciting discoveries. -- A. A. Milne

--72btQdUC6twB1rwh
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFFwGNki/mCfdEo8UoRAjBeAJ4zYnyfZcXzZFt9QWtRm8HetgPe1QCfWu8/
JNt5EMV5PyD1nDHdjCSpAqE=
=VaRq
-----END PGP SIGNATURE-----

--72btQdUC6twB1rwh--
 
S

Shot (Piotr Szotkowski)

--dZHW955j1vPFHE0Q
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Mauricio Fernandez:
(The order of the elements in Hash#to_a can change.)

Thanks for pointing this out, I fixed my specs and my code.
Try this, it's O(n ln n) but actually works

Ok, I guess I have a question on what does exactly Benchmark results
mean. Doesn=E2=80=99t the below mean hash_map_sort is actually a bit slower=
than
hash_sort?

Also, to ask two basic, follow-up questions: which Benchmark time is
the most significant one and is it more-or-less independent from other
system load? (I guess the two runs wouldn=E2=80=99t differ as much any of t=
he
times was other-load-independent=E2=80=A6)



$ cat block_hash.rb=20
require 'benchmark'
require 'set'

class Block < Set
def hash_sort
sort.hash
end
def hash_map_sort
map {|a| a.hash}.sort.hash
end
end

srand 1979
array =3D (0..1_000_000).sort_by {rand}

Benchmark.bmbm do |b|
b.report 'hash_sort' do
Block.new(array).hash_sort
end
b.report 'hash_map_sort' do
Block.new(array).hash_map_sort
end
end



$ ruby block_hash.rb=20
Rehearsal -------------------------------------------------
hash_sort 8.000000 1.366667 9.366667 ( 6.208513)
hash_map_sort 9.750000 1.500000 11.250000 ( 7.179749)
--------------------------------------- total: 20.616667sec

user system total real
hash_sort 9.666667 1.100000 10.766667 ( 6.677047)
hash_map_sort 9.566667 1.550000 11.116667 ( 6.809388)



$ ruby block_hash.rb=20
Rehearsal -------------------------------------------------
hash_sort 8.133333 1.316667 9.450000 ( 5.746571)
hash_map_sort 9.650000 1.766667 11.416667 ( 7.065570)
--------------------------------------- total: 20.866667sec

user system total real
hash_sort 9.316667 1.100000 10.416667 ( 6.684088)
hash_map_sort 9.450000 1.566667 11.016667 ( 6.796299)



-- Shot
--=20
It is a good morning exercise for a research scientist
to discard a pet hypothesis every day before breakfast.
It keeps him young. -- Konrad Lorenz

--dZHW955j1vPFHE0Q
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFFwGy3i/mCfdEo8UoRAvnLAJ0QiS+/ytBcbnTdit+y9expErHT2gCgpkD8
7IMhLLOL0b39wshrBpMvqD0=
=th1S
-----END PGP SIGNATURE-----

--dZHW955j1vPFHE0Q--
 
M

Mauricio Fernandez

Eric Hodel:


Thanks, but (considering Mauricio Fernandez’s fix) what would be the
benefit of defining #eql? in terms of Set#sort instead of Set#==?

"None". You cannot just
def eql?(o); sort == o.sort end
either, since there's no guarantee there's a total order associated to the
set (it seems my reply to Pit where I pointed this out was dropped too):

require 'set'
Set.new(['a', 1]).sort # =>
# ~> -:2:in `sort': comparison of String with 1 failed (ArgumentError)
# ~> from -:2

This is why the hash implementation I gave you does
map{|x| x.hash}.sort.hash # sorting hash values, i.e. Integers
instead of
sort.map{|x| x.hash}.hash

Even when the elements can be ordered, sort would be O(n ln n) vs. #=='s O(n)

Of course, the former is a "fast NlnN", being written in C, so it would seem
that it can keep the speed advantage up to large values of N.

However, #sort will process the whole array before it can start comparing
values, and #== can begin right away. If there are many differing elements,
the probability of finding one of them in the first comparisons is very high.

Somewhat surprisingly, #== can beat #sort even in the most favorable case for
the latter (only one differing element, the first in the sorted array but not
in Hash#to_a, small sets):

require 'benchmark'
require 'set'
size = 64
elm = -10
elm2 = -2
s1 = Set.new((0..size).to_a << -1)
s2 = Set.new((0..size).to_a << elm)

GC.disable # don't let this interfere
def bm(s1, s2, elm, iterations = 5000)
puts "#== will stop after #{1 + s2.to_a.index(elm)} .include? tests"
Benchmark.bmbm(10) do |bm|
bm.report("Set#sort") { iterations.times{ s1.sort == s2.sort } }
bm.report("Set#==") { iterations.times{ s1 == s2 } }
end
end

bm(s1, s2, elm)
puts "-" * 40
bm(s1, Set.new((0..size).to_a << elm2), elm2)

# >> #== will stop after 43 .include? tests
# >> Rehearsal ---------------------------------------------
# >> Set#sort 0.700000 0.050000 0.750000 ( 0.768818)
# >> Set#== 0.370000 0.010000 0.380000 ( 0.381117)
# >> ------------------------------------ total: 1.130000sec
# >>
# >> user system total real
# >> Set#sort 0.680000 0.060000 0.740000 ( 0.736052)
# >> Set#== 0.380000 0.010000 0.390000 ( 0.380995)
# >> ----------------------------------------
# >> #== will stop after 7 .include? tests
# >> Rehearsal ---------------------------------------------
# >> Set#sort 0.620000 0.010000 0.630000 ( 0.637358)
# >> Set#== 0.170000 0.000000 0.170000 ( 0.175028)
# >> ------------------------------------ total: 0.800000sec
# >>
# >> user system total real
# >> Set#sort 0.750000 0.050000 0.800000 ( 0.807318)
# >> Set#== 0.170000 0.010000 0.180000 ( 0.176633)
 
P

Pit Capitain

Shot said:
Ok, I guess I have a question on what does exactly Benchmark results
mean. Doesn’t the below mean hash_map_sort is actually a bit slower than
hash_sort?

This is what I would expect. The advantage of Mauricio's solution is
that you can use it even if the Set elements are not comparable to each
other. My proposal was meant for your use case with integers as Set
elements.
srand 1979
array = (0..1_000_000).sort_by {rand}

Benchmark.bmbm do |b|
b.report 'hash_sort' do
Block.new(array).hash_sort
end
b.report 'hash_map_sort' do
Block.new(array).hash_map_sort
end
end

With this benchmark, you also measure the time needed to construct the
Block instances. If you want to compare the two #hash methods, you
should change your benchmark to something like:

block = Block.new(array)

Benchmark.bmbm do |b|
b.report 'hash_sort' do
block.hash_sort
end
b.report 'hash_map_sort' do
block.hash_map_sort
end
end

From your benchmark, you can see that it takes a lot of time to compute
the hash values. Will your sets contain 1_000_000 elements? How often
will they change? How are they created? Maybe you need to cache the hash
values or implement a different hash algorithm.

Regards,
Pit
 
S

Shot (Piotr Szotkowski)

--jE+K4o1MICf3EEet
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Mauricio Fernandez:
"None". You cannot just =20
def eql?(o); sort =3D=3D o.sort end
either

But I can

class Block < Set
def eql? other
self =3D=3D other
end
end

can=E2=80=99t I? Set#=3D=3D says that =E2=80=98the equality of each couple =
of elements is
defined according to Object#eql?=E2=80=99 and that, coupled with Block#hash,
is enough for me to create a set of blocks acting the way I need it.

Note: I need a set of blocks to consider two blocks the same if the
blocks contain the same integers, i.e., I need two such blocks to be
eql? and have the same hashes; the above seems to do it for me quite
nicely=E2=80=A6

(Thaks a lot for your detailed explanation snipped here,
I appreciate it very much and learned a lot from it!)
GC.disable # don't let this interfere

I=E2=80=99m learning new things every day, this one will come in very handy.

-- Shot
--=20
My favourite was some time ago, and involved a female customer thanking
"Mr. Daemon" for his effort trying to deliver her mail, and offering
him a "good time" if he ever visited Sydney. -- Matt McLeod

--jE+K4o1MICf3EEet
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFFwLHmi/mCfdEo8UoRAk0sAJ9VDMolMeIBZGyPTyhq5XkaQ3b9twCeNJrC
dJVOd4KFdwDBGzdaVmJ4WdA=
=xkNB
-----END PGP SIGNATURE-----

--jE+K4o1MICf3EEet--
 
S

Shot (Piotr Szotkowski)

--y1+kC1Q2udlVMFpc
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Pit Capitain:
The advantage of Mauricio's solution is that you can use it even if
the Set elements are not comparable to each other. My proposal was
meant for your use case with integers as Set elements.

Ah, right. Still, blocks will contain integers only, so
Array#sort-based Block#hash and Set#=3D=3D-based Block#eql?
should suffice.
With this benchmark, you also measure the time needed to construct
the Block instances. If you want to compare the two #hash methods,
you should change your benchmark to something like:

D=E2=80=99oh! Why did I think moving the array creation
out was the most I can get to unbias the benchmark?
From your benchmark, you can see that it takes a lot of time to
compute the hash values. Will your sets contain 1_000_000 elements?
How often will they change? How are they created? Maybe you need to
cache the hash values or implement a different hash algorithm.

I don=E2=80=99t know, yet. I was just curious whether the elements=E2=80=99=
-hash-based
hash algorithm would be any faster than the just-sort-based one (given
that from a quick glance it seemed to incorporate additional work); now
I understand its advantage lies elsewhere.

Thanks a lot for your input!

-- Shot
--=20
In ObjectSpace, nobody can hear you quack. -- Luc Heinrich, ruby-talk

--y1+kC1Q2udlVMFpc
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFFwLVDi/mCfdEo8UoRAsuVAKDMVd+6RmEPSHHQsCOrXXqBakE20wCfXvB1
JLXtquzRt2fRlEkJqjWMIOI=
=e9W6
-----END PGP SIGNATURE-----

--y1+kC1Q2udlVMFpc--
 
T

Thomas Hafner

Hello,

is my news reader (Gnus v5.9.0) broken? I'm just trying to explain why
I see things like that:

Shot (Piotr Szotkowski) said:
--jE+K4o1MICf3EEet
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[..]
in terms of Set#sort instead of Set#=3D=3D?
^^^^^
"None". You cannot just =20
^^^
Regards
Thomas
 
S

Shot (Piotr Szotkowski)

--vtBqmokgY1evbO3C
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Thomas Hafner:
is my news reader (Gnus v5.9.0) broken? I'm just
trying to explain why I see things like that:
^^^

It looks like Gnus 5.9 doesn=E2=80=99t handle quoted-printable.

A quick googling turned up
http://www.splode.com/~friedman/software/emacs-lisp/src/qpdecode.el

-- Shot
--=20
A funny symbol that I can't read has just been input. Continue,
and I'll forget that it ever happened. -- TeX

--vtBqmokgY1evbO3C
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFFxR+Xi/mCfdEo8UoRApICAJ4sJGSnYqKGpHKL8KQVA0/4uS/csQCfcnHX
fgEGgtHejlwyzTexG6UuLDk=
=cbjy
-----END PGP SIGNATURE-----

--vtBqmokgY1evbO3C--
 
T

Thomas Hafner

Shot (Piotr Szotkowski) said:
It looks like Gnus 5.9 doesn=E2=80=99t handle quoted-printable.

Strange that it does handle it most of the time, but not always. I
never observed the problem with e-mails (Gnus is a mail client, too),
e.g.:

Header =
...
Content-Type: multipart/mixed; boundary="=-=-="

First part =
--=-=-=
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
...

=> Ok.

Only some news articles have disturbed my Gnus so far.

Thanks
Thomas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top