Sets, uniqueness not unique.

R

Robert Klemme

Hugh said:
Hugh said:
While my brain is behaving like cottage cheese, it's probably not
the time to ask how one might guarantee that you don't stomp on the
hashes of other ojects in the system. If you have an even number of
elements, all the same Fixnum, like [1,1,1,1] then they would hash
to 0, as would [2,2], I "think".
irb(main):004:0> [1,1].inject(0) { |a,b| a ^= b.hash}
=> 0
irb(main):005:0> [2,1,1,2].inject(0) { |a,b| a ^= b.hash}
=> 0

Btw, the assignment is superfluous. The result of a^b.hash is the
next iteration's a.

Yes, good point, the result of the block....
Yep.

------------------------------------------------------
Enumerable#inject enum.inject(initial) {| memo, obj | block }
=> obj enum.inject {| memo, obj | block } => obj
------------------------------------------------------------------------
Combines the elements of _enum_ by applying the block to an
accumulator value (_memo_) and each element in turn. At each
step, _memo_ is set to the value returned by the block. The
first form ================================================
[...]
irb(main):006:0>

Yes. The algorithm can certainly be improved on. Typically you
rather do something similar to

(a.hash ^ (b.hash << 3) ^ (c.hash << 7)) & MAX_HASH

09:53:59 [~]: irbs
[2,1,1,2].inject(0) { |a,b| ((a << 3) ^ b.hash) & 0xFFFF_FFFF}
=> 2781
lo>> [1,2, 1,2].inject(0) { |a,b| ((a << 3) ^ b.hash) & 0xFFFF_FFFF}

Ah, so that's what MAX_HASH is, I couldn't remember how big Fixnums
were.

No, that's not really the limit - I just choose 32 bit as it's an often
used size. :)

13:41:59 [~]: ruby -e 'i=1;i<<=1 while Fixnum === i;i>>=1;puts i.to_s(16),
i.class'
20000000
Fixnum
I was thinking about something like a linear congruential
random number generator like:
brains hgs 22 %> irb
irb(main):001:0> [2,1,1,2].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 151936
irb(main):002:0> [2,2,1,1].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 153856
irb(main):003:0> [2,1,2,1].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 151996
irb(main):004:0>

like
http://www.cs.bell-labs.com/cm/cs/pearls/markovhash.c

from "Programming Pearls".

That code does only one modulus on the NHASH. Your code does it for each
iteration. But otherwise 31 seems a good number - Java containers use 31,
too. :)

From AbstractList:

public int hashCode() {
int hashCode = 1;
Iterator i = iterator();
while (i.hasNext()) {
Object obj = i.next();
hashCode = 31*hashCode + (obj==null ? 0 : obj.hashCode());
}
return hashCode;
}

with a largish prime grabbed from
http://primes.utm.edu/lists/small/small.html#10

being the biggest I could see less than 0XFFFF_FFFF (4294967295)


------------------------------------------------ Class: Fixnum <
Integer A +Fixnum+ holds +Integer+ values that can be
represented in a native machine word (minus 1 bit). If any
operation on a +Fixnum+

Should that be 0x7FFF_FFFF? (2147483647)
According to
http://www.rsok.com/~jrm/printprimes.html
this would seem to be a prime number, so could be used as the
modulus anyway.

You can even use bit and on this which I would expect to be slightly more
efficient.

Kind regards

robert
 
H

Hugh Sasse

Hugh said:
On Fri, 16 Sep 2005, Robert Klemme wrote:

lo>> [1,2, 1,2].inject(0) { |a,b| ((a << 3) ^ b.hash) & 0xFFFF_FFFF}

Ah, so that's what MAX_HASH is, I couldn't remember how big Fixnums
were.

No, that's not really the limit - I just choose 32 bit as it's an often
used size. :)

13:41:59 [~]: ruby -e 'i=1;i<<=1 while Fixnum === i;i>>=1;puts i.to_s(16),
i.class'
20000000
Fixnum
I was thinking about something like a linear congruential
random number generator like:
brains hgs 22 %> irb
irb(main):001:0> [2,1,1,2].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 151936
irb(main):002:0> [2,2,1,1].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 153856
irb(main):003:0> [2,1,2,1].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 151996
irb(main):004:0>

like
http://www.cs.bell-labs.com/cm/cs/pearls/markovhash.c

from "Programming Pearls".

That code does only one modulus on the NHASH. Your code does it for each

I wasn't following closely enough. :)
iteration. But otherwise 31 seems a good number - Java containers use 31,
too. :)
From AbstractList: [...]

------------------------------------------------ Class: Fixnum <
Integer A +Fixnum+ holds +Integer+ values that can be
represented in a native machine word (minus 1 bit). If any
operation on a +Fixnum+

Should that be 0x7FFF_FFFF? (2147483647)
According to
http://www.rsok.com/~jrm/printprimes.html
this would seem to be a prime number, so could be used as the
modulus anyway.

You can even use bit and on this which I would expect to be slightly more
efficient.

Agreed. Is the RCR yours? Maybe an RCRCR :) is appropriate, "change
the hashing algorithm in that code to use LCG".
Kind regards

robert

Hugh
 
R

Robert Klemme

Hugh said:
Hugh said:
On Fri, 16 Sep 2005, Robert Klemme wrote:

lo>> [1,2, 1,2].inject(0) { |a,b| ((a << 3) ^ b.hash) & 0xFFFF_FFFF}
=> 1885

Ah, so that's what MAX_HASH is, I couldn't remember how big Fixnums
were.

No, that's not really the limit - I just choose 32 bit as it's an
often used size. :)

13:41:59 [~]: ruby -e 'i=1;i<<=1 while Fixnum === i;i>>=1;puts
i.to_s(16), i.class'
20000000
Fixnum
I was thinking about something like a linear congruential
random number generator like:
brains hgs 22 %> irb
irb(main):001:0> [2,1,1,2].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 151936
irb(main):002:0> [2,2,1,1].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 153856
irb(main):003:0> [2,1,2,1].inject(0) {|a,b| ((a * 31)+b.hash) %
4093082899} => 151996
irb(main):004:0>

like
http://www.cs.bell-labs.com/cm/cs/pearls/markovhash.c

from "Programming Pearls".

That code does only one modulus on the NHASH. Your code does it for
each

I wasn't following closely enough. :)

Someone talked about his brain behaving like cottage cheese today... :))
iteration. But otherwise 31 seems a good number - Java containers
use 31, too. :)
From AbstractList: [...]

------------------------------------------------ Class: Fixnum <
Integer A +Fixnum+ holds +Integer+ values that can be
represented in a native machine word (minus 1 bit). If any
operation on a +Fixnum+

Should that be 0x7FFF_FFFF? (2147483647)
According to
http://www.rsok.com/~jrm/printprimes.html
this would seem to be a prime number, so could be used as the
modulus anyway.

You can even use bit and on this which I would expect to be slightly
more efficient.

Agreed. Is the RCR yours?
Yes.

Maybe an RCRCR :) is appropriate, "change
the hashing algorithm in that code to use LCG".

Done.

Thanks for the inspiring discussion!

robert
 
H

Hugh Sasse

Someone talked about his brain behaving like cottage cheese today... :))

There is that. :)

I thought about this over lunch. I don't think you can use bit AND.

Suppose we have a 4 bit machine[!] and MAX_HASH is 3

6 % 3 == 0, but 6 & 3 == 2
7 % 3 == 1, but 7 & 3 == 3

etc. The theory behind linear *congruential* random number
generators, (which is how we're stirring the hashes) is based on
primality in modulo arithmetic. I think if we destroy that
relationship things will be less random and we'll get more
collisions.

Testing for randomness is tricky enough, so I'd not like to have to
prove this. But I have a suspicion that given the amount of work by
Knuth and others on the topic, if & could be used instead of % it
would be the recommended way to do it, because it is cheaper/faster.
Done.

Thanks for the inspiring discussion!

robert

Thank you,
Hugh
 
R

Robert Klemme

Hugh said:
Someone talked about his brain behaving like cottage cheese today...
:))

There is that. :)
[...]
Should that be 0x7FFF_FFFF? (2147483647)
According to
http://www.rsok.com/~jrm/printprimes.html
this would seem to be a prime number, so could be used as the
modulus anyway.

You can even use bit and on this which I would expect to be
slightly more efficient.

I thought about this over lunch. I don't think you can use bit AND.

Suppose we have a 4 bit machine[!] and MAX_HASH is 3

6 % 3 == 0, but 6 & 3 == 2
7 % 3 == 1, but 7 & 3 == 3

Erm, if you want to use bit AND instead of modulo you have to use
different values (i.e. the modulo value us bit and value + 1):
0 0
1 1
2 2
3 3
0 0
1 1
2 2
3 3

This in turn means that the modulo always must be a power of two and thus
is extremely unlikely to be prime. I would even go so far as to claim
that the likelyhood is smaller than every nonzero positive epsilon in the
real numbers that you can think of. :)
etc. The theory behind linear *congruential* random number
generators, (which is how we're stirring the hashes) is based on
primality in modulo arithmetic. I think if we destroy that
relationship things will be less random and we'll get more
collisions.

Well, sometimes you'll have to do tradeoffs. The Java implementation
doesn't even use and or modulo to limit the value (not needed with Java
int, they simply wrap). I'll trust the Sun guys to have ensured that the
way they do it is a reasonable tradeoff between randomness and speed.
Testing for randomness is tricky enough, so I'd not like to have to
prove this. But I have a suspicion that given the amount of work by
Knuth and others on the topic, if & could be used instead of % it
would be the recommended way to do it, because it is cheaper/faster.

Yes, if you want to use a prime you definitely have to use modulo (see
above). I was more on the Sun side, i.e. just limiting the value in order
to keep it a Fixnum.

Kind regards

robert
 
H

Hugh Sasse

Looks like it still is then...

There is that. :)
[...]
Should that be 0x7FFF_FFFF? (2147483647)
According to
http://www.rsok.com/~jrm/printprimes.html
this would seem to be a prime number, so could be used as the
modulus anyway.

You can even use bit and on this which I would expect to be
slightly more efficient.

I thought about this over lunch. I don't think you can use bit AND.

Suppose we have a 4 bit machine[!] and MAX_HASH is 3

6 % 3 == 0, but 6 & 3 == 2
7 % 3 == 1, but 7 & 3 == 3

Erm, if you want to use bit AND instead of modulo you have to use
different values (i.e. the modulo value us bit and value + 1):
0 0
1 1
2 2
3 3
0 0
1 1
2 2
3 3

When I did that in my head I munged it then! (hence my comment
above).
This in turn means that the modulo always must be a power of two and thus
is extremely unlikely to be prime. I would even go so far as to claim
that the likelyhood is smaller than every nonzero positive epsilon in the
real numbers that you can think of. :)

But in this case (2 ** n) - 1 is prime, which is what we want. So I
think it should be 0x8000_0000 (2147483648) for the & version,
equivalent to 0x7FFF_FFFF for the % version.
Well, sometimes you'll have to do tradeoffs. The Java implementation
doesn't even use and or modulo to limit the value (not needed with Java
int, they simply wrap).
Ouch.

I'll trust the Sun guys to have ensured that the
way they do it is a reasonable tradeoff between randomness and speed.

Java was intended for small systems originally, then backwards
compatibility, ....
Yes, if you want to use a prime you definitely have to use modulo (see
above). I was more on the Sun side, i.e. just limiting the value in order
to keep it a Fixnum.

I suppose I may be aiming too much for perfection, but I wonder how
the & 0x8000_0000 version stacks up.
 
H

Hugh Sasse

When I did that in my head I munged it then! (hence my comment
above).

Which, (having tested it) means my previous reply was complete
rubbish! I was confused again.

I don't think modulo need be that expensive, because it can be

def modulo (mod)
if self > mod
modulo(self - mod)
else
self
end
end

when the number and modulus are positive. That may cost too much
chough.
Hugh
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,183
Messages
2,570,968
Members
47,516
Latest member
ChrisHibbs

Latest Threads

Top