Question regarding checksuming of a file

A

Andrew Robert

Good evening,

I need to generate checksums of a file, store the value in a variable,
and pass it along for later comparison.

The MD5 module would seem to do the trick but I'm sketchy on implementation.


The nearest I can see would be

import md5

m=md5.new()
contents = open(self.file_name,"rb").read()
check=md5.update(contents)

However this does not appear to be actually returning the checksum.

Does anyone have insight into where I am going wrong?

Any help you can provide would be greatly appreciated.

Thanks
 
A

Andrew Robert

Actually, I think I got it but would like to confirm this looks right.

import md5
checksum = md5.new()
mfn = open(self.file_name, 'r')
for line in mfn.readlines():
checksum.update(line)
mfn.close()
cs = checksum.hexdigest()
print cs

The value cs should contain the MD5 checksum or did I miss something?

Any help you can provide would be greatly appreciated.

Thanks
 
R

Roy Smith

Andrew Robert said:
Good evening,

I need to generate checksums of a file, store the value in a variable,
and pass it along for later comparison.

The MD5 module would seem to do the trick but I'm sketchy on implementation.


The nearest I can see would be

import md5

m=md5.new()
contents = open(self.file_name,"rb").read()
check=md5.update(contents)

However this does not appear to be actually returning the checksum.

Does anyone have insight into where I am going wrong?

After calling update(), you need to call digest(). Update() only updates
the internal state of the md5 state machine; digest() returns the hash.
Also, for the code above, it's m.update(), not md5.update(). Update() is a
method of an md5 instance object, not the md5 module itself.

Lastly, the md5 algorithm is known to be weak. If you're doing md5 to
maintain compatability with some pre-existing implementation, that's one
thing. But, if you're starting something new from scratch, I would suggest
using SHA-1 instead (see the sha module). SHA-1 is much stronger
cryptographically than md5. The Python API is virtually identical, so it's
no added work to switch to the stronger algorithm.
 
A

Andrew Robert

Roy said:
After calling update(), you need to call digest(). Update() only updates
the internal state of the md5 state machine; digest() returns the hash.
Also, for the code above, it's m.update(), not md5.update(). Update() is a
method of an md5 instance object, not the md5 module itself.

Lastly, the md5 algorithm is known to be weak. If you're doing md5 to
maintain compatability with some pre-existing implementation, that's one
thing. But, if you're starting something new from scratch, I would suggest
using SHA-1 instead (see the sha module). SHA-1 is much stronger
cryptographically than md5. The Python API is virtually identical, so it's
no added work to switch to the stronger algorithm.

Hi Roy,

This is strictly for checking if a file was corrupted during transit
over an MQSeries channel.

The check is not intended to be used for crypto purposes.
 
A

Ant

A script I use for comparing files by MD5 sum uses the following
function, which you may find helps:

def getSum(self):
md5Sum = md5.new()

f = open(self.filename, 'rb')

for line in f:
md5Sum.update(line)

f.close()

return md5Sum.hexdigest()
 
P

Paul Rubin

Ant said:
def getSum(self):
md5Sum = md5.new()
f = open(self.filename, 'rb')
for line in f:
md5Sum.update(line)
f.close()
return md5Sum.hexdigest()

This should work, but there is one hazard if the file is very large
and is not a text file. You're trying to read one line at a time from
it, which means a contiguous string of characters up to a newline.
Depending on the file contents, that could mean gigabytes which get
read into memory. So it's best to read a fixed size amount in each
operation, e.g. (untested):

def getblocks(f, blocksize=1024):
while True:
s = f.read(blocksize)
if not s: return
yield s

then change "for line in f" to "for line in f.getblocks()".

I actually think an iterator like the above should be added to the
stdlib, since the "for line in f" idiom is widely used and sometimes
inadvisable, like the fixed sized buffers in those old C programs
that led to buffer overflow bugs.
 
A

Andrew Robert

When I run the script, I get an error that the file object does not have
the attribute getblocks.

Did you mean this instead?

def getblocks(f, blocksize=1024):
while True:
s = f.read(blocksize)
if not s: return
yield s

def getsum(self):
md5sum = md5.new()
f = open(self.file_name, 'rb')
for line in getblocks(f) :
md5sum.update(line)
f.close()
return md5sum.hexdigest()
 
H

Heiko Wundram

Am Sonntag 14 Mai 2006 20:51 schrieb Andrew Robert:
def getblocks(f, blocksize=1024):
while True:
s = f.read(blocksize)
if not s: return
yield s

This won't work. The following will:

def getblocks(f,blocksize=1024):
while True:
s = f.read(blocksize)
if not s: break
yield s

--- Heiko.
 
P

Paul Rubin

Andrew Robert said:
When I run the script, I get an error that the file object does not have
the attribute getblocks.

Woops, yes, you have to call getblocks(f). Also, Heiko says you can't
use "return" to break out of the generator; I thought you could but
maybe I got confused.
 
H

Heiko Wundram

Am Sonntag 14 Mai 2006 22:29 schrieb Paul Rubin:
Woops, yes, you have to call getblocks(f). Also, Heiko says you can't
use "return" to break out of the generator; I thought you could but
maybe I got confused.

Yeah, you can. You can't return <arg> in a generator (of course, this raises a
SyntaxError), but you can use return to generate a raise StopIteration. So,
it wasn't you who was confused... ;-)

--- Heiko.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,297
Messages
2,571,529
Members
48,241
Latest member
PorterShor

Latest Threads

Top