Efficient MD5 (or similar) hashes

  • Thread starter Kamus of Kadizhar
  • Start date
K

Kamus of Kadizhar

ANother newbie question:

I have large files I'm dealing with. Some 600MB -1.2 GB in size, over a
slow network. Transfer of one of these files can take 40 minutes or an
hour.

I want to check the integrity of the files after transfer. I can check
the obvious - date, file size - quickly, but what if I want an MD5 hash?

From reading the python docs, md5 reads the entire file as a string.
That's not practical on a 1 GB file that's network mounted.

The only thing I can think of is to set up an inetd daemon on the server
that will spit out the md5 hash if given the file path/name.

Any other ideas?

-Kamus
 
E

Erik Max Francis

Kamus said:
I want to check the integrity of the files after transfer. I can
check
the obvious - date, file size - quickly, but what if I want an MD5
hash?

From reading the python docs, md5 reads the entire file as a string.
That's not practical on a 1 GB file that's network mounted.

Python's md5 module just accepts updating strings; the driving code
certainly doesn't have to read the file all in at once. Just read it in
a chunk at a time:

hasher = md5.new()
while True:
chunk = theFile.read(CHUNK_SIZE)
if not chunk:
break
hasher.update(chunk)
theHash = hasher.hexdigest()
 
B

Bengt Richter

ANother newbie question:

I have large files I'm dealing with. Some 600MB -1.2 GB in size, over a
slow network. Transfer of one of these files can take 40 minutes or an
hour.

I want to check the integrity of the files after transfer. I can check
the obvious - date, file size - quickly, but what if I want an MD5 hash?

From reading the python docs, md5 reads the entire file as a string.
I don't know what docs you're reading, but if your read the docs on the
md5 module, you'll see you don't have to do that.
also you could interactively type help('md5')
or import md5 followed by help(md5)
That's not practical on a 1 GB file that's network mounted.
Well, whatever calculates the md5 will have to read all the bytes from the source
you want to check. If you have downloaded a file to another machine, then
the fastest will be to run the md5 calculation there, but if you have a gigabit lan
connection and things aren't busy, IWT it wouldn't make much difference if you
read it that way.

If you have a c/c++ excutable utility that will calculate md5, it will probably
be fastest to run that on the file. You can run it from python via popen, if that's
the context you want to control it from.

I think there's ways to RPC to accomplish the same remotely, but I haven't played with that.
The only thing I can think of is to set up an inetd daemon on the server
that will spit out the md5 hash if given the file path/name.

Any other ideas?

Describe your setup in a little more detail. Someone has probably done it before.

Regards,
Bengt Richter
 
B

Bengt Richter

Python's md5 module just accepts updating strings; the driving code
certainly doesn't have to read the file all in at once. Just read it in
a chunk at a time:
PMJI, but don't forget to open the file in binary,
e.g., theFile = file(thePath, 'rb'), if you're on windows.
hasher = md5.new()
while True:
chunk = theFile.read(CHUNK_SIZE)
if not chunk:
break
hasher.update(chunk)
theHash = hasher.hexdigest()

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,172
Messages
2,570,934
Members
47,474
Latest member
AntoniaDea

Latest Threads

Top