Hash Code

J

j1mb0jay

I am currently trying to improve up-on my own implementation of a hash
table. Using Java 1.6.

At the moment I am setting the size of the hash table to be the next
prime number which is double the value of the expected number of items to
be added. (Using the Sieve of Atkin prime test)

Each item(Object) that is added has a unique String as a form of ID. I
use the String.hashCode() method on this ID and then Mod this 32bit
integer, so that I then can use this number as an index into the array of
linked lists, that represents the hash table. ( The array of linked lists
is so the hash table can handle collisions ).

I was wondering if instead of using the String.hashCode() method I used a
secure hash, like SHA512 or MD5. Then converted this 512bit / 128bit
number into base10, so that i could parse it as a String into the
constructor of java.math.BigInteger class, then Mod then number by the
size of the hash table so it could be used an index into the array.

Would this mean there would be no collisions thus keeping the time
complexity of the search method down to O(1). In turn meaning the hash
table could be an array of Key Value pairs rather than a array of linked
lists.

Regards j1mb0jay
 
R

Roedy Green

I was wondering if instead of using the String.hashCode() method I used a
secure hash, like SHA512 or MD5. Then converted this 512bit / 128bit
number into base10, so that i could parse it as a String into the
constructor of java.math.BigInteger class, then Mod then number by the
size of the hash table so it could be used an index into the array.

Those digests are very slow calculations. I would hazard a guess they
will cost you many times what you get back in more even distribution.
I'd think you would be better to spend that RAM on a longer lookup
vector.

The standard Hashtable leaves it up to the user to provide the
hashCode function.

I remember seeing a C algorithm years ago for guaranteeing zero
collisions if you knew the keys in advance. It could construct you a
hashcode function.

You might find my essays at http://mindprod.com/jgloss/hashcode.html
http://mindprod.com/jgloss/hashtable.html
http://mindprod.com/jgloss/hashmap.html
http://mindprod.com/jgloss/primes.html


may give you useful background information.

To save time, you might just embed a table of primes, say the ones
just under the powers of two.
 
J

j1mb0jay

No. There could still be collisions. In addition, the time it would
take to calculate a hash would be significant, meaning the performance
of the algorithm would degrade.

Just to be clear, if you are trying to map something with
2,000,000,000,000 possible values to a hash table with 200 entries,
there is simply no way to avoid collisions. Sometimes the data will
simply collide.

I understand that this would be the case but I have to say that i was
told as part of the module that re-hashing the whole table into a new
table with a higher table size is the best option. This is of course
expensive in terms of CPU time. I think the java hash map does that at
0.75 (when 75% of the table is full) by default.

Yes i understand that the hash it self would also be intensive if using
SHA512 but this number is 16 times bigger than the output of a
String.hashCode() function.


Please can you give some more information of why this would not be the
case !!

Regards James.
 
T

Tom Anderson

At the moment I am setting the size of the hash table to be the next
prime number which is double the value of the expected number of items
to be added. (Using the Sieve of Atkin prime test)

If you have a good hash function, you don't need to use a prime number for
table size, any size will do.
Each item(Object) that is added has a unique String as a form of ID. I
use the String.hashCode() method on this ID and then Mod this 32bit
integer, so that I then can use this number as an index into the array
of linked lists, that represents the hash table. ( The array of linked
lists is so the hash table can handle collisions ).

I was wondering if instead of using the String.hashCode() method I used
a secure hash, like SHA512 or MD5. Then converted this 512bit / 128bit
number into base10, so that i could parse it as a String into the
constructor of java.math.BigInteger class, then Mod then number by the
size of the hash table so it could be used an index into the array.

Would this mean there would be no collisions

No. Completely avoiding collisions is impossible - if there are more
things in the input domain than in the table, some *must* map to the same
slot.

It would also be very, very slow - cryptographic hashes are engineered to
have properties which are completely unnecessary for use in general
hashing, and do so at the expense of speed.

String's hashCode is a pretty good one - it's a variant of the Bernstein
hash, which is fairly standard. If you want a better one, though,
implement this:

http://burtleburtle.net/bob/hash/doobs.html

There's a catalogue of other hashes at the bottom.
thus keeping the time complexity of the search method down to O(1). In
turn meaning the hash table could be an array of Key Value pairs rather
than a array of linked lists.

There are three ways to avoid collision lists. The first is to use
reprobing instead; you can look that up if you don't know what it is - i
suggesting using quadratic reprobing, as it's simple and fast. The second
is to use a cuckoo hash - again, look it up.

The third, well, i lied when i said avoiding collisions was impossible.
There is kind of a way to do it, but with an important drawback: you have
to choose the hashing function and table size specifically for the data
you want to hash. You have to know the keys before you can build the
table. Also, building the table is typically quite slow. However, once
you've done it, lookups are very fast. The techniques for doing this are
known as perfect hashing - you can look them up too.

tom
 
J

j1mb0jay

j1mb0jay said:
On Tue, 20 May 2008 18:48:15 +0100, j1mb0jay wrote:

Would this mean there would be no collisions thus keeping the time
complexity of the search method down to O(1). In turn meaning the
hash table could be an array of Key Value pairs rather than a array
of linked lists.
No. There could still be collisions. In addition, the time it would
take to calculate a hash would be significant, meaning the performance
of the algorithm would degrade.

Just to be clear, if you are trying to map something with
2,000,000,000,000 possible values to a hash table with 200 entries,
there is simply no way to avoid collisions. Sometimes the data will
simply collide.

I understand that this would be the case but I have to say that i was
told as part of the module that re-hashing the whole table into a new
table with a higher table size is the best option. [...]

It depends on how you define "best." Think of it as an
ordinary purchase-and-sale transaction: Expanding the table is an
investment of CPU time today in hopes of recouping it and more tomorrow.
For the same table with the same sequence of insertions, deletions, and
searches, the investment could be "good" or "bad" depending on other
circumstances.

Even the above is an oversimplification. For example, it
might be important to guarantee that no search takes more than 7
milliquavers, even at the cost of blowing 1000000 mq on every
seventy-third insertion.

Finally, rehash-or-not needn't be a binary decision. Look
up "incremental hashing" techniques that expand and contract a hash
table gradually instead of in one great cathartic spasm.
Yes i understand that the hash it self would also be intensive if using
SHA512 but this number is 16 times bigger than the output of a
String.hashCode() function.

Fine, but what do you do with that great big number? You
can't use it as an array index until you've derived a much smaller
number from it -- you can't even use String's hashCode() until you've
reduced its range. And once you've reduced the range to 0-198 there are
only 199 places the items can go. If you've got 15 items to scatter
among 199 places, there's about a 53% chance you'll have at least one
collision. With 200 or more items, obviously, the probability grows to
100%.
Let's see: If you start with a bunch of empty slots and
start filling them as long as there are no collisions, and then when a
collision occurs you ~double the table size and re-hash, and keep doing
so until there are no more collisions ... Well, that's more math than I
want to attempt right at the moment, but it's clear that you'll use an
ENORMOUS amount of space. Note that the ~doubling might not eliminate
the collision (and might even produce collisions among the existing
items that didn't collide in the smaller table), so ~quadrupling and so
on might be necessary ...

I understand
Here's a suggestion: Try implementing your scheme as you've
outlined it, perhaps cutting a few corners here and there for
simplicity's sake. Load your implementation with detailed logging of
what's going on: Each insertion, each deletion, each lookup, each table
expansion or contraction, and so on. Then run a few test cases and study
the logs.



Would it make sense to only use the secure hash when the table size
exceeds the max value for an integer (2^32) and then at that point
rather than using an array use a specially implemented doubly linked
list. Although after a point you would have to start using B-Trees as it
would not all fit in memory at anyone time.

Would this allow fast search access for huge amounts of data, or would it
be slow and nasty because of the expensiveness of Hash and the greater
time complexity required access time for each element in the list
(compared to the array).

I currently have a the hash table implemented and will add the logging as
requested. Although i do already know that after inserting 6 million
unique words into a hash table with a size of 10 million the longest link
list contained 13 items. Which means that the time complexity for a
search was O(13) from 6 million items.

I understand that for the hash tables with a size of less that 2^32 that
the normal 32bit hash value is more than enough, i was just wondering for
hash tables that are larger.
 
J

j1mb0jay

Why didn't i think of that before, a checksum is perfect !!!!

Thanks

Regards j1mb0jay
 
T

Tom Anderson

I understand that this would be the case but I have to say that i was
told as part of the module that re-hashing the whole table into a new
table with a higher table size is the best option. This is of course
expensive in terms of CPU time. I think the java hash map does that at
0.75 (when 75% of the table is full) by default.

Yes i understand that the hash it self would also be intensive if using
SHA512 but this number is 16 times bigger than the output of a
String.hashCode() function.

Yes, but what are you going to do with that huge number?

tom
 
T

Tom Anderson

Would it make sense to only use the secure hash when the table size
exceeds the max value for an integer (2^32) and then at that point
rather than using an array use a specially implemented doubly linked
list.

LOL. If you're dealing with >2^32 items, a linked list would be a really,
really, *really* bad way to do it.

And a doubly linked list would be truly pessimal.
Although after a point you would have to start using B-Trees as it would
not all fit in memory at anyone time.

Would this allow fast search access for huge amounts of data, or would
it be slow and nasty because of the expensiveness of Hash

With that many elements, and the possible need to do IO to fetch things,
the time taken by the hash becomes insignificant. So you're alright there.
and the greater time complexity required access time for each element in
the list (compared to the array).

Yes, that pretty much ruins you. Finding a specific index in the linked
list would involve *billions* of cycles, and reading every location in
memory!

Unless it was very, very specially implemented.
I currently have a the hash table implemented and will add the logging
as requested. Although i do already know that after inserting 6 million
unique words into a hash table with a size of 10 million the longest
link list contained 13 items. Which means that the time complexity for a
search was O(13) from 6 million items.

Which is not so bad!
I understand that for the hash tables with a size of less that 2^32 that
the normal 32bit hash value is more than enough, i was just wondering
for hash tables that are larger.

When you have a hash table with >2^32 objects, choosing a hash function is
probably not your major problem.

If you do need a 64-bit hash, it's trivial to implement Bernstein's hash
with long instead of int, and it'll work just as well. If you need a
hashtable that holds >2^64 items, well, get back to us when you do.

tom
 
J

j1mb0jay

Thank you all very much for the replies to this post didn't expect it to
be so popular.

Ok so i have learnt that that using secure hashes are that slow that they
are completely pointless for the exercise that i require. I am now going
to implement versions of "Bernstein hash". Do people agree that this is
hash to use? , other people commented on so other hashes, why would they
be better that this one ?

I thought i remember someone saying that for smaller hash tables < 2^16
that there is no point setting the hash table to the size of prime
number, i thought should be done as when i want to mod the hash down to
size a prime number would give less collisions. Am I wrong.

Does anyone know of a faster prime check that the Sieve of Atkin method
(found on Wiki)

Can anyone explain why using a CRC32 hash function is poor for a hash
table, i thought this would be perfect, but a few of you think otherwise.
I would like to try and keep the hash function a one at time function
rather than getting a list of items that need adding hashing them all at
once and then adding them.

Regards j1mb0jay
 
J

j1mb0jay

Tom said:

There are also two new entrants I've run across lately doing similar
research: SuperFastHash and MurmurHash. They each claim to be faster
than previous algorithms[*] and (at least) MurmurHash claims better
distribution too.

[*] I seem to recall one or both is tuned for X86. I.e. the speed may
only be "superfast" on Intel-type architectures. But this is all from
(imperfect) memory.

And of course for all these hashes the reference implementation is in C
or C++, so would require some programming and possibly tuning for Java.

B.O.

I will make sure i have a good read through the two of these hash
functions, this gives me three different kinds to implement and compare
now. I will post back to you when i Have the details about my
implementation of these two.

I personally like the idea of writing the java application and then
executing the C code when required from inside C. Writing a hash function
in C will of course be quick, but not as quick as Intel assembly !!

Regards j1mb0jay
 
T

Tom Anderson

There are also two new entrants I've run across lately doing similar
research: SuperFastHash and MurmurHash.

SuperFastHash is the one i link to above; people other than its author
seem to refer to it as Hsieh's hash. I came across MurmurHash, but ignored
it for some reason. It does look like a very good hash. Although some guys
see one or two problems:

http://groups.google.co.uk/group/sci.crypt/browse_thread/thread/56aaa4c236c7b095

I should also mention Fowler-Noll-Vo, aka FNV, which is basically the same
as Bernstein, but with a different multiplier and xor instead of addition.
It's probably no better than Bernstein, and not as good as Hsieh or
murmur, but it's widely-used.

tom
 
T

Tom Anderson

Is the hash at the link in the top a personal favourite or proven
to be good.

The author claims it's good, and supplies some evidence for that claim.

tom
 
T

Tris Orendorff

Tom Anderson said:
CRC32, at least, is a terrible hash function for hashtables.

Here's a good one:

http://www.azillionmonkeys.com/qed/hash.html

You are all guilty of premature optimization. I have never encountered a problem with a slow hash function or an unbalanced table. The default implementation and behaviour has always been okay and has never become a bottleneck.
 
R

Roedy Green

No. There could still be collisions. In addition, the time it would
take to calculate a hash would be significant, meaning the performance of
the algorithm would degrade.

I wondered how well this algorithm would perform.

You compute an array of hashcodes for the elements you want to have
fast lookup for in your Hashtable.

If two elements hash to the same value, tough. we are going to have a
collision.

Try dividing all the hashcodes by two.

See if there are collisions. If none, note that 2 away somewhere, and
continue with the reduced hashcodes.

Repeat dividing by 2 until you get collisions.

Now try 3.

Now try 5 etc. working your way up

You now have a list of prime factors. Multiply them together and
divide your hashcode by that. Use that number to index a lookup
array. If that does not reduce the largest hashcode sufficiently,
redo the work, allowing 2 collisions per stage or something similar.
 
J

j1mb0jay

Note that this trick applies only to Hashtable, not
to the newer HashMap and HashSet. (In fairness to the trick's author,
HashMap and HashSet didn't yet exist when the paper was written.)

If you're writing your own new hash implementation
from scratch and if you want to stick to prime-number sizes, just build
in a precalculated array of two dozen or so primes in an approximately
doubling sequence.

Good idea, thank you.

j1mb0jay
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Code help please 4
SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS 1
Java 1
Hash Code Compression 26
Help with code 0
Prime Numbers 19
Blue J Ciphertext Program 2
Tasks 1

Members online

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,833
Latest member
BettyeMacf

Latest Threads

Top