best/better way of md5suming of really large file in ruby?

K

Kyle Schmitt

I've got a script that is going through data, and in some cases,
generating md5s of the files. Normally this isn't a problem, but I've
got a few largish (~2G) files in there, and my script is dying on it.
I ran it in a screen so I'm not sure the exact error it threw, but I'm
re-running just that part now to find out. In the meanwhile, any
suggestions?

This is how I'm generating the md5sum right now....
Digest::MD5.hexdigest(File.read(fn))

--Kyle
 
Y

Yun Huang Yong

Kyle said:
I've got a script that is going through data, and in some cases,
generating md5s of the files. Normally this isn't a problem, but I've
got a few largish (~2G) files in there, and my script is dying on it.
I ran it in a screen so I'm not sure the exact error it threw, but I'm
re-running just that part now to find out. In the meanwhile, any
suggestions?

I googled for 'md5 large files' and ended up here:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/184834

yun
 
R

Reid Thompson

I googled for 'md5 large files' and ended up here:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/184834

yun
rthompso@raker /cpartition/hold $ ls -rlt dummyfile
-rw-r--r-- 1 rthompso staff 2147483648 2009-04-22 10:27 dummyfile
rthompso@raker /cpartition/hold $ irb
irb(main):001:0> result = %x[md5sum dummyfile]
=> "a981130cf2b7e09f4686dc273cf7187e dummyfile\n"
irb(main):002:0> p result
"a981130cf2b7e09f4686dc273cf7187e dummyfile\n"
=> nil
irb(main):003:0> def timeit
irb(main):004:1> tstart = Time.now
irb(main):005:1> result = %x[md5sum dummyfile]
irb(main):006:1> tend = Time.now
irb(main):007:1> elapsed = tend - tstart
irb(main):008:1> puts elapsed.to_s
irb(main):009:1> end
=> nil
irb(main):011:0> timeit
10.633416
=> nil
 
R

Reid Thompson

I googled for 'md5 large files' and ended up here:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/184834

yun
rthompso@raker /cpartition/hold $ ls -rlt dummyfile
-rw-r--r-- 1 rthompso staff 2147483648 2009-04-22 10:27 dummyfile
rthompso@raker /cpartition/hold $ irb
irb(main):001:0> result = %x[md5sum dummyfile]
=> "a981130cf2b7e09f4686dc273cf7187e dummyfile\n"
irb(main):002:0> p result
"a981130cf2b7e09f4686dc273cf7187e dummyfile\n"
=> nil
irb(main):003:0> def timeit
irb(main):004:1> tstart = Time.now
irb(main):005:1> result = %x[md5sum dummyfile]
irb(main):006:1> tend = Time.now
irb(main):007:1> elapsed = tend - tstart
irb(main):008:1> puts elapsed.to_s
irb(main):009:1> end
=> nil
irb(main):011:0> timeit
10.633416
=> nil
more realistic...
rthompso@raker /cpartition/hold $ dd if=/dev/urandom of=dummyfile
count=4M
4194304+0 records in
4194304+0 records out
2147483648 bytes (2.1 GB) copied, 529.518 s, 4.1 MB/s
rthompso@raker /cpartition/hold $ irb
irb(main):001:0> def timeit
irb(main):002:1> tstart = Time.now
irb(main):003:1> result = %x[md5sum dummyfile]
irb(main):004:1> tend = Time.now
irb(main):005:1> elapsed = tend - tstart
irb(main):006:1> puts elapsed.to_s
irb(main):007:1> end
=> nil
irb(main):008:0> timeit
49.366641
=> nil
irb(main):009:0> timeit
48.416673
=> nil
irb(main):010:0>
 
K

Kyle Schmitt

Thanks both of you. I'd rather not shell out using %x[, but I may end
up doing that. I tried the modified MD5, and it actually ran in close
to the same time on my work machine, have to see how it does against
my home one.

--Kyle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,175
Messages
2,570,942
Members
47,491
Latest member
mohitk

Latest Threads

Top