Question regarding checksuming of a file

Andrew Robert · May 14, 2006

Good evening,

I need to generate checksums of a file, store the value in a variable,
and pass it along for later comparison.

The MD5 module would seem to do the trick but I'm sketchy on implementation.

The nearest I can see would be

import md5

m=md5.new()
contents = open(self.file_name,"rb").read()
check=md5.update(contents)

However this does not appear to be actually returning the checksum.

Does anyone have insight into where I am going wrong?

Any help you can provide would be greatly appreciated.

Thanks

Edward Elliott · May 14, 2006

Andrew said:
m=md5.new()
contents = open(self.file_name,"rb").read()
check=md5.update(contents)

However this does not appear to be actually returning the checksum.

the docs are your friend, use them. hint: first you eat, then you...
http://docs.python.org/lib/module-md5.html

Andrew Robert · May 14, 2006

Actually, I think I got it but would like to confirm this looks right.

import md5
checksum = md5.new()
mfn = open(self.file_name, 'r')
for line in mfn.readlines():
checksum.update(line)
mfn.close()
cs = checksum.hexdigest()
print cs

The value cs should contain the MD5 checksum or did I miss something?

Any help you can provide would be greatly appreciated.

Thanks

Roy Smith · May 14, 2006

Andrew Robert said:
Good evening,

I need to generate checksums of a file, store the value in a variable,
and pass it along for later comparison.

The MD5 module would seem to do the trick but I'm sketchy on implementation.

The nearest I can see would be

import md5

m=md5.new()
contents = open(self.file_name,"rb").read()
check=md5.update(contents)

However this does not appear to be actually returning the checksum.

Does anyone have insight into where I am going wrong?

After calling update(), you need to call digest(). Update() only updates
the internal state of the md5 state machine; digest() returns the hash.
Also, for the code above, it's m.update(), not md5.update(). Update() is a
method of an md5 instance object, not the md5 module itself.

Lastly, the md5 algorithm is known to be weak. If you're doing md5 to
maintain compatability with some pre-existing implementation, that's one
thing. But, if you're starting something new from scratch, I would suggest
using SHA-1 instead (see the sha module). SHA-1 is much stronger
cryptographically than md5. The Python API is virtually identical, so it's
no added work to switch to the stronger algorithm.

Andrew Robert · May 14, 2006

Roy said:
After calling update(), you need to call digest(). Update() only updates
the internal state of the md5 state machine; digest() returns the hash.
Also, for the code above, it's m.update(), not md5.update(). Update() is a
method of an md5 instance object, not the md5 module itself.

Lastly, the md5 algorithm is known to be weak. If you're doing md5 to
maintain compatability with some pre-existing implementation, that's one
thing. But, if you're starting something new from scratch, I would suggest
using SHA-1 instead (see the sha module). SHA-1 is much stronger
cryptographically than md5. The Python API is virtually identical, so it's
no added work to switch to the stronger algorithm.

Hi Roy,

This is strictly for checking if a file was corrupted during transit
over an MQSeries channel.

The check is not intended to be used for crypto purposes.

Ant · May 14, 2006

A script I use for comparing files by MD5 sum uses the following
function, which you may find helps:

def getSum(self):
md5Sum = md5.new()

f = open(self.filename, 'rb')

for line in f:
md5Sum.update(line)

f.close()

return md5Sum.hexdigest()

Paul Rubin · May 14, 2006

Ant said:
def getSum(self):
md5Sum = md5.new()
f = open(self.filename, 'rb')
for line in f:
md5Sum.update(line)
f.close()
return md5Sum.hexdigest()

This should work, but there is one hazard if the file is very large
and is not a text file. You're trying to read one line at a time from
it, which means a contiguous string of characters up to a newline.
Depending on the file contents, that could mean gigabytes which get
read into memory. So it's best to read a fixed size amount in each
operation, e.g. (untested):

def getblocks(f, blocksize=1024):
while True:
s = f.read(blocksize)
if not s: return
yield s

then change "for line in f" to "for line in f.getblocks()".

I actually think an iterator like the above should be added to the
stdlib, since the "for line in f" idiom is widely used and sometimes
inadvisable, like the fixed sized buffers in those old C programs
that led to buffer overflow bugs.

Andrew Robert · May 14, 2006

When I run the script, I get an error that the file object does not have
the attribute getblocks.

Did you mean this instead?

def getblocks(f, blocksize=1024):
while True:
s = f.read(blocksize)
if not s: return
yield s

def getsum(self):
md5sum = md5.new()
f = open(self.file_name, 'rb')
for line in getblocks(f) :
md5sum.update(line)
f.close()
return md5sum.hexdigest()

Heiko Wundram · May 14, 2006

Am Sonntag 14 Mai 2006 20:51 schrieb Andrew Robert:

def getblocks(f, blocksize=1024):
while True:
s = f.read(blocksize)
if not s: return
yield s

This won't work. The following will:

def getblocks(f,blocksize=1024):
while True:
s = f.read(blocksize)
if not s: break
yield s

--- Heiko.

Paul Rubin · May 14, 2006

Andrew Robert said:
When I run the script, I get an error that the file object does not have
the attribute getblocks.

Woops, yes, you have to call getblocks(f). Also, Heiko says you can't
use "return" to break out of the generator; I thought you could but
maybe I got confused.

Heiko Wundram · May 14, 2006

Am Sonntag 14 Mai 2006 22:29 schrieb Paul Rubin:

Woops, yes, you have to call getblocks(f). Also, Heiko says you can't
use "return" to break out of the generator; I thought you could but
maybe I got confused.

Yeah, you can. You can't return <arg> in a generator (of course, this raises a
SyntaxError), but you can use return to generate a raise StopIteration. So,
it wasn't you who was confused... ;-)

--- Heiko.

How to upload a compressed file (.gz) to the swift object storage using the Python swift client?	1	Jul 24, 2024
Question regarding commit/backout of a message using the pymqi module	1	Jun 21, 2006
Want to host websites that I will probably be the only user from home. Sacrilege, I know, but it has always been a dream of mine. Where do I start?	2	Aug 13, 2024
The question regarding type of pointers	17	Apr 25, 2012
Processes for the Unrestricted Transfer of Windows Live Mail (.eml) to PST	1	Dec 7, 2024
wired md5 hashing problem	1	Mar 26, 2006
Extra Newby question - Trying to create md5 File Listing	3	Sep 27, 2006
use python to split a video file into a set of parts	2	May 7, 2013

Question regarding checksuming of a file

Andrew Robert

Edward Elliott

Andrew Robert

Roy Smith

Andrew Robert

Ant

Paul Rubin

Andrew Robert

Heiko Wundram

Paul Rubin

Heiko Wundram

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads