Ben said:
Basically I want to generate an md5 hash from considerably large files
to determine if they are exactly the same. Is there a better way to do
this besides comparing md5 hashes?
Thanks for your help.
I conducted a few tests to compare the performance of different
comparison methods. I tested using string comparison, the zlib
library's crc32 checksum, and the Digest::MD5 hash. The file is
iterated over in chunks and the 1K, 10K, etc refer to the size of the
chunks. There is also a whole file measure for each of them.
The test files were identical Ogg Vorbis audio files just below 8MB in
size (identical files should give worst-case performance). Times are
for 100 repetitions.
Rehearsal -------------------------------------------------------
...... removed for brevity
-------------------------------------------- total: 214.900000sec
user system total real
String 1K 13.400000 4.250000 17.650000 ( 10.612437)
String 10K 7.633333 4.716667 12.350000 ( 7.420777)
String 100K 7.616667 4.166667 11.783333 ( 7.071255)
String Whole 7.300000 6.433333 13.733333 ( 8.260925)
CRC32 1K 16.700000 4.466667 21.166667 ( 12.774677)
CRC32 10K 9.833333 4.600000 14.433333 ( 8.769574)
CRC32 100K 9.383333 4.166667 13.550000 ( 8.129907)
CRC32 Whole 9.016667 6.333333 15.350000 ( 9.221654)
MD5 1K 26.833333 4.833333 31.666667 ( 19.087961)
MD5 10K 16.133333 4.333333 20.466667 ( 12.327322)
MD5 100K 15.216667 4.083333 19.300000 ( 11.703880)
MD5 Whole 14.633333 6.333333 20.966667 ( 12.634441)
Notice that using MD5 is significantly slower than normal string
comparison. This also demonstrates that there are few performance gains
between 10KB buffers and 100KB buffers, indicating that somewhere in
the 10K range would be a good buffer size for the memory/performance
tradeoff.
Of course if you really need speed you may want to code in C and
improve these times further, but a comparison rate of almost 100MB per
second isn't too shabby.
Here's the test code for those interested:
require 'zlib'
require 'digest/md5'
require 'benchmark'
def step_blocks(file_a, file_b, block_size)
until file_a.eof?
a = file_a.read(block_size)
b = file_b.read(block_size)
yield a, b
end
end
def test_string_equality(file_a, file_b, block_size)
step_blocks(file_a, file_b, block_size) do |a, b|
return false unless a == b
end
true
end
def test_crc32_equality(file_a, file_b, block_size)
step_blocks(file_a, file_b, block_size) do |a, b|
return false unless Zlib::crc32(a) == Zlib::crc32(b)
end
true
end
def test_md5_equality(file_a, file_b, block_size)
step_blocks(file_a, file_b, block_size) do |a, b|
return false unless Digest::MD5.digest(a) == Digest::MD5.digest(b)
end
true
end
def test_files(filename_a, filename_b, test_method, other_args)
raise ArgumentError unless File.exists?(filename_a) &&
File.exists?(filename_b)
return false unless File.size(filename_a) == File.size(filename_b)
file_a = File.new(filename_a, 'r')
file_b = File.new(filename_b, 'r')
result = send(test_method, file_a, file_b, *other_args)
file_a.close
file_b.close
result
end
FILE1 = "a.ogg"
FILE2 = "b.ogg"
REPEATS = 100
if $0 == __FILE__
Benchmark.bmbm(20) do |x|
x.report("String 1K") {REPEATS.times{test_files(FILE1, FILE2,
:test_string_equality, 1024)}}
x.report("String 10K") {REPEATS.times{test_files(FILE1, FILE2,
:test_string_equality, 10240)}}
x.report("String 100K") {REPEATS.times{test_files(FILE1, FILE2,
:test_string_equality, 102400)}}
x.report("String Whole") {REPEATS.times{test_files(FILE1, FILE2,
:test_string_equality, nil)}}
x.report("CRC32 1K") {REPEATS.times{test_files(FILE1, FILE2,
:test_crc32_equality, 1024)}}
x.report("CRC32 10K") {REPEATS.times{test_files(FILE1, FILE2,
:test_crc32_equality, 10240)}}
x.report("CRC32 100K") {REPEATS.times{test_files(FILE1, FILE2,
:test_crc32_equality, 102400)}}
x.report("CRC32 Whole") {REPEATS.times{test_files(FILE1, FILE2,
:test_crc32_equality, nil)}}
x.report("MD5 1K") {REPEATS.times{test_files(FILE1, FILE2,
:test_md5_equality, 1024)}}
x.report("MD5 10K") {REPEATS.times{test_files(FILE1, FILE2,
:test_md5_equality, 10240)}}
x.report("MD5 100K") {REPEATS.times{test_files(FILE1, FILE2,
:test_md5_equality, 102400)}}
x.report("MD5 Whole") {REPEATS.times{test_files(FILE1, FILE2,
:test_md5_equality, nil)}}
end
end