binary file compare...

S

SpreadTooThin

I want to compare two binary files and see if they are the same.
I see the filecmp.cmp function but I don't get a warm fuzzy feeling
that it is doing a byte by byte comparison of two files to see if they
are they same.

What should I be using if not filecmp.cmp?
 
S

SpreadTooThin

Perhaps I'm being dim, but how else are you going to decide if
two files are the same unless you compare the bytes in the
files?

You could hash them and compare the hashes, but that's a lot
more work than just comparing the two byte streams.


I don't understand what you've got against comparing the files
when you stated that what you wanted to do was compare the files.

I think its just the way the documentation was worded
http://www.python.org/doc/2.5.2/lib/module-filecmp.html

Unless shallow is given and is false, files with identical os.stat()
signatures are taken to be equal.
Files that were compared using this function will not be compared
again unless their os.stat() signature changes.

So to do a comparison:
filecmp.cmp(filea, fileb, False)
?
 
S

SpreadTooThin

Doh!  I misread your post and thought were weren't getting a
warm fuzzying feeling _because_ it was doing a byte-byte
compare. Now I'm a bit confused.  Are you under the impression
it's _not_ doing a byte-byte compare?  Here's the code:

def _do_cmp(f1, f2):
    bufsize = BUFSIZE
    fp1 = open(f1, 'rb')
    fp2 = open(f2, 'rb')
    while True:
        b1 = fp1.read(bufsize)
        b2 = fp2.read(bufsize)
        if b1 != b2:
            return False
        if not b1:
            return True

It looks like a byte-by-byte comparison to me.  Note that when
this function is called the file lengths have already been
compared and found to be equal.

--
Grant Edwards                   grante             Yow! Alright, you!!
                                  at               Imitate a WOUNDED SEAL
                               visi.com            pleading for a PARKING
                                                   SPACE!!

I am indeed under the impression that it is not always doing a byte by
byte comparison...
as well the documentation states:
Compare the files named f1 and f2, returning True if they seem equal,
False otherwise.

That word... Seeeeem... makes me wonder.

Thanks for the code! :)
 
P

Peter Otten

Grant said:
Doh! I misread your post and thought were weren't getting a
warm fuzzying feeling _because_ it was doing a byte-byte
compare. Now I'm a bit confused. Are you under the impression
it's _not_ doing a byte-byte compare? Here's the code:

def _do_cmp(f1, f2):
bufsize = BUFSIZE
fp1 = open(f1, 'rb')
fp2 = open(f2, 'rb')
while True:
b1 = fp1.read(bufsize)
b2 = fp2.read(bufsize)
if b1 != b2:
return False
if not b1:
return True

It looks like a byte-by-byte comparison to me. Note that when
this function is called the file lengths have already been
compared and found to be equal.

But there's a cache. A change of file contents may go undetected as long as
the file stats don't change:

$ cat fool_filecmp.py
import filecmp, shutil, sys

for fn in "adb":
with open(fn, "w") as f:
f.write("yadda")

shutil.copystat("d", "a")
filecmp.cmp("a", "b", False)

with open("a", "w") as f:
f.write("*****")
shutil.copystat("d", "a")

if "--clear" in sys.argv:
print "clearing cache"
filecmp._cache.clear()

if filecmp.cmp("a", "b", False):
print "file a and b are equal"
else:
print "file a and b differ"
print "a's contents:", open("a").read()
print "b's contents:", open("b").read()

$ python2.6 fool_filecmp.py
file a and b are equal
a's contents: *****
b's contents: yadda

Oops. If you are paranoid you have to clear the cache before doing the
comparison:

$ python2.6 fool_filecmp.py --clear
clearing cache
file a and b differ
a's contents: *****
b's contents: yadda

Peter
 
S

Steven D'Aprano

Perhaps I'm being dim, but how else are you going to decide if two files
are the same unless you compare the bytes in the files?

If you start with an image in one format (e.g. PNG), and convert it to
another format (e.g. JPEG), you might want the two files to compare equal
even though their byte contents are completely different, because their
contents (the image itself) is visually identical.

Or you might want a heuristic as a short cut for comparing large files,
and decide that if two files have the same size and modification dates,
and the first (say) 100KB are equal, that you will assume the rest are
probably equal too.

Neither of these are what the OP wants, I'm just mentioning them to
answer your rhetorical question :)
 
D

Dave Angel

SpreadTooThin said:
I am indeed under the impression that it is not always doing a byte by
byte comparison...
as well the documentation states:
Compare the files named f1 and f2, returning True if they seem equal,
False otherwise.

That word... Seeeeem... makes me wonder.

Thanks for the code! :)
Some of this discussion depends on the version of Python, but didn't say
so. In version 2.61, the code is different (and more complex) than
what's listed above. The docs are different too. In this version, at
least, you'll want to explicitly pass the shallow=False parameter. It
defaults to 1, by which they must mean True. I think it's a bad
default, but it's still a useful function. Just be careful to include
that parameter in your call.

Further, you want to check the version included with your version. The
file filecmp.py is in the Lib directory, so it's not trouble to check it.
 
A

Adam Olsen

Good point.  You can fool it if you force the stats to their
old values after you modify a file and you don't clear the
cache.

The timestamps stored on the filesystem (for ext3 and most other
filesystems) are fairly coarse, so it's quite possible for a check/
update/check sequence to have the same timestamp at the beginning and
end.
 
M

Martin

Hi,

Perhaps I'm being dim, but how else are you going to decide if
two files are the same unless you compare the bytes in the
files?

I'd say checksums, just about every download relies on checksums to
verify you do have indeed the same file.
You could hash them and compare the hashes, but that's a lot
more work than just comparing the two byte streams.

hashing is not exactly much mork in it's simplest form it's 2 lines per file.

$ dd if=/dev/urandom of=testfile.data bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 1.4491 s, 3.6 MB/s
$ dd if=/dev/urandom of=testfile2.data bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 1.92479 s, 2.7 MB/s
$ cp testfile.data testfile3.data
$ python
Python 2.5.4 (r254:67916, Feb 17 2009, 20:16:45)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.


--
http://soup.alt.delete.co.at
http://www.xing.com/profile/Martin_Marcher
http://www.linkedin.com/in/martinmarcher

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
 
S

Steven D'Aprano

I'd say checksums, just about every download relies on checksums to
verify you do have indeed the same file.

The checksum does look at every byte in each file. Checksumming isn't a
way to avoid looking at each byte of the two files, it is a way of
mapping all the bytes to a single number.


hashing is not exactly much mork in it's simplest form it's 2 lines per
file.

Hashing is a *lot* more work than just comparing two bytes. The MD5
checksum has been specifically designed to be fast and compact, and the
algorithm is still complicated:

http://en.wikipedia.org/wiki/MD5#Pseudocode

The reference implementation is here:

http://www.fastsum.com/rfc1321.php#APPENDIXA

SHA-1 is even more complicated still:

http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-1_pseudocode


Just because *calling* some checksum function is easy doesn't make the
checksum function itself simple. They do a LOT more work than just a
simple comparison between bytes, and that's totally unnecessary work if
you are making a one-off comparison of two local files.
 
M

Martin

The checksum does look at every byte in each file. Checksumming isn't a
way to avoid looking at each byte of the two files, it is a way of
mapping all the bytes to a single number.

My understanding of the original question was a way to determine
wether 2 files are equal or not. Creating a checksum of 1-n files and
comparing those checksums IMHO is a valid way to do that. I know it's
a (one way) mapping between a (possibly) longer byte sequence and
another one, how does checksumming not take each byte in the original
sequence into account.

I'd still say rather burn CPU cycles than development hours (if I got
the question right), if not then with binary files you will have to
find some way of representing differences between the 2 files in a
readable manner anyway.
Hashing is a *lot* more work than just comparing two bytes. The MD5
checksum has been specifically designed to be fast and compact, and the
algorithm is still complicated:

I know that the various checksum algorithms aren't exactly cheap, but
I do think that just to know wether 2 files are different a solution
which takes 5mins to implement wins against a lengthy discussion which
optimizes too early wins hands down.

regards,
martin

--
http://soup.alt.delete.co.at
http://www.xing.com/profile/Martin_Marcher
http://www.linkedin.com/in/martinmarcher

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
 
N

Nigel Rantor

Martin said:
My understanding of the original question was a way to determine
wether 2 files are equal or not. Creating a checksum of 1-n files and
comparing those checksums IMHO is a valid way to do that. I know it's
a (one way) mapping between a (possibly) longer byte sequence and
another one, how does checksumming not take each byte in the original
sequence into account.

The fact that two md5 hashes are equal does not mean that the sources
they were generated from are equal. To do that you must still perform a
byte-by-byte comparison which is much less work for the processor than
generating an md5 or sha hash.

If you insist on using a hashing algorithm to determine the equivalence
of two files you will eventually realise that it is a flawed plan
because you will eventually find two files with different contents that
nonetheless hash to the same value.

The more files you test with the quicker you will find out this basic truth.

This is not complex, it's a simple fact about how hashing algorithms work.

n
 
N

Nigel Rantor

Grant said:
We all rail against premature optimization, but using a
checksum instead of a direct comparison is premature
unoptimization. ;)

And more than that, will provide false positives for some inputs.

So, basically it's a worse-than-useless approach for determining if two
files are the same.

n
 
S

SpreadTooThin

That's slower than a byte-by-byte compare.



I meant a lot more CPU time/cycles.

I'd like to add my 2 cents here.. (Thats 1.8 cents US)
All I was trying to get was a clarification of the documentation of
the cmp method.
It isn't clear.

byte by byte comparison is good enough for me as long as there are no
cache issues.
a check sum is not good because it doesn't guarantee that 1 + 2 + 3
== 3 + 2 + 1
a crc of any sort is more work than a byte by byte comparison and
doesn't give you any more information.
 
A

Adam Olsen

The fact that two md5 hashes are equal does not mean that the sources
they were generated from are equal. To do that you must still perform a
byte-by-byte comparison which is much less work for the processor than
generating an md5 or sha hash.

If you insist on using a hashing algorithm to determine the equivalence
of two files you will eventually realise that it is a flawed plan
because you will eventually find two files with different contents that
nonetheless hash to the same value.

The more files you test with the quicker you will find out this basic truth.

This is not complex, it's a simple fact about how hashing algorithms work..

The only flaw on a cryptographic hash is the increasing number of
attacks that are found on it. You need to pick a trusted one when you
start and consider replacing it every few years.

The chance of *accidentally* producing a collision, although
technically possible, is so extraordinarily rare that it's completely
overshadowed by the risk of a hardware or software failure producing
an incorrect result.
 
N

Nigel Rantor

Adam said:
The chance of *accidentally* producing a collision, although
technically possible, is so extraordinarily rare that it's completely
overshadowed by the risk of a hardware or software failure producing
an incorrect result.

Not when you're using them to compare lots of files.

Trust me. Been there, done that, got the t-shirt.

Using hash functions to tell whether or not files are identical is an
error waiting to happen.

But please, do so if it makes you feel happy, you'll just eventually get
an incorrect result and not know it.

n
 
L

Lawrence D'Oliveiro

Nigel said:
Not when you're using them to compare lots of files.

Trust me. Been there, done that, got the t-shirt.

Not with any cryptographically-strong hash, you haven't.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,999
Messages
2,570,243
Members
46,835
Latest member
lila30

Latest Threads

Top