R
Roedy Green
I have been benchmarking some hashing algorithms. The fastest and
fewest collisions are near the top. I used real world data.
The big surprise for me is how good String.hashCode is.
Collecting 20000 filenames from drive E...
The best have the lowest elapsed times and the fewest collisions.
Benchmarking String.hashCode...
Computing 100000 String.hashCode hashes...
Time to compute 100000 iterations of String.hashCode was 12.200
seconds.
Checking 20000 String.hashCode for collisions...
There were 0 collisions.
------
Benchmarking SunCRC32...
Computing 100000 SunCRC32 hashes...
Time to compute 100000 iterations of SunCRC32 was 66.515 seconds.
Checking 20000 SunCRC32 for collisions...
There were 0 collisions.
------
Benchmarking FNV1a32...
Computing 100000 FNV1a32 hashes...
Time to compute 100000 iterations of FNV1a32 was 48.146 seconds.
Checking 20000 FNV1a32 for collisions...
There were 0 collisions.
------
Benchmarking Adler...
Computing 100000 Adler hashes...
Time to compute 100000 iterations of Adler was 45.458 seconds.
Checking 20000 Adler for collisions...
There were 39 collisions.
------
Benchmarking SunCRC16...
Computing 100000 SunCRC16 hashes...
Time to compute 100000 iterations of SunCRC16 was 215.462 seconds.
Checking 20000 SunCRC16 for collisions...
There were 2768 collisions.
------
I will be adding some more. If you want be to test your favourite,
just implement this interface:
/**
* interface that an algorithm must provide to be benchmarked
*/
interface HashAlgorithm
{
/**
* identifying info
*/
String getAlgorithmName();
/**
* compute hash, must be converted to a String for collision
checking.
* Slow version, used for collision checking.
*/
String computeHashString( String s );
/**
* It need not include code to convert to string.
* Fast version, used for timing
*/
void computeHash( String s );
}
If you are inventing a hashing algorithm that do not involve every
byte, keep in mind that in the real world the Strings you are hashing
often begin and end with the same strings.
e.g. E:\mindprod .... .html
You might be able to cook up some special algorithm just for absolute
filenames.
fewest collisions are near the top. I used real world data.
The big surprise for me is how good String.hashCode is.
Collecting 20000 filenames from drive E...
The best have the lowest elapsed times and the fewest collisions.
Benchmarking String.hashCode...
Computing 100000 String.hashCode hashes...
Time to compute 100000 iterations of String.hashCode was 12.200
seconds.
Checking 20000 String.hashCode for collisions...
There were 0 collisions.
------
Benchmarking SunCRC32...
Computing 100000 SunCRC32 hashes...
Time to compute 100000 iterations of SunCRC32 was 66.515 seconds.
Checking 20000 SunCRC32 for collisions...
There were 0 collisions.
------
Benchmarking FNV1a32...
Computing 100000 FNV1a32 hashes...
Time to compute 100000 iterations of FNV1a32 was 48.146 seconds.
Checking 20000 FNV1a32 for collisions...
There were 0 collisions.
------
Benchmarking Adler...
Computing 100000 Adler hashes...
Time to compute 100000 iterations of Adler was 45.458 seconds.
Checking 20000 Adler for collisions...
There were 39 collisions.
------
Benchmarking SunCRC16...
Computing 100000 SunCRC16 hashes...
Time to compute 100000 iterations of SunCRC16 was 215.462 seconds.
Checking 20000 SunCRC16 for collisions...
There were 2768 collisions.
------
I will be adding some more. If you want be to test your favourite,
just implement this interface:
/**
* interface that an algorithm must provide to be benchmarked
*/
interface HashAlgorithm
{
/**
* identifying info
*/
String getAlgorithmName();
/**
* compute hash, must be converted to a String for collision
checking.
* Slow version, used for collision checking.
*/
String computeHashString( String s );
/**
* It need not include code to convert to string.
* Fast version, used for timing
*/
void computeHash( String s );
}
If you are inventing a hashing algorithm that do not involve every
byte, keep in mind that in the real world the Strings you are hashing
often begin and end with the same strings.
e.g. E:\mindprod .... .html
You might be able to cook up some special algorithm just for absolute
filenames.