How to compare files?

C

Case

What is a good (readable) way of comparing two
files. I just need to know if they match, or not.

Thanks,

Case
 
M

Martin Kissner

Case wrote :
What is a good (readable) way of comparing two
files. I just need to know if they match, or not.

since seem to be on Linux, you might use "diff"
diff file1 file2
diff -y file1 file2
if you want to use the side by side output format.

HTH
 
M

Michele Dondi

What is a good (readable) way of comparing two
files. I just need to know if they match, or not.

If with "if they match" you mean "if they are _exactly_ the same",
then I'd just take a cheksum (e.g. MD5) of both and compare them.


Michele
 
J

Josef Moellers

Case said:
What is a good (readable) way of comparing two
files. I just need to know if they match, or not.

It depends upon the level of confidence you place upon the result.
You might be satisfied if the md5sums of both files are equal. In that
case have a look at Digest::MD5. If you must do this regularly, you can
save the md5sum in some file and retrieve it, saving some work the next
time.
 
J

Josef Moellers

Michele said:
If with "if they match" you mean "if they are _exactly_ the same",
then I'd just take a cheksum (e.g. MD5) of both and compare them.

Since an MD5 checksum is often shorter than the file it is taken of (it
would be pointless to use if it weren't), the statement "two files are
exactly the same iff the MD5 checksums are equal" is wrong. After all,
there are 2^1048576 different 1MB files (do we still use these small
files? B-{) but only 2^128 different MD5 checksums, so, on average,
2^1048448 1MB-files share the same MD5 checksum.

There is a high chance that two files are _exactly_ the same if both
files have some additional restrictions as to their contents, e.g. both
are valid JPEG files, but there still exist pathological situations
where the MD5 checksums are equal even if the files differ.
 
A

Arndt Jonasson

Martin Kissner said:
Case wrote :

since seem to be on Linux, you might use "diff"
diff file1 file2
diff -y file1 file2
if you want to use the side by side output format.

'cmp' is better, if all you want to know is whether they differ or
not, and not present the differences. We don't know whether they are
text files.

But what Case means by "readable", I don't know.
 
C

Case

Chad said:
use File::Compare;
Is probably better for what you want.

I have two text files (simple XML actually). I use
what you suggest and it works fine. The code is only
one statement and 'compare' is clear, hence it is
quite readable as well.

Thanks,

Kees
 
M

Martin Kissner

Arndt Jonasson wrote :
'cmp' is better, if all you want to know is whether they differ or
not, and not present the differences. We don't know whether they are
text files.

But what Case means by "readable", I don't know.

Neither do I.
I don't even know if he wants to diff two files in the shell or by a
Perlscript

In the shell, however, you could use 'diff -q' if you only want to know
if the files differ. I don't know if 'cmp' or 'diff' is better.
 
A

Arndt Jonasson

Martin Kissner said:
Arndt Jonasson wrote :

Neither do I.
I don't even know if he wants to diff two files in the shell or by a
Perlscript

In the shell, however, you could use 'diff -q' if you only want to know
if the files differ. I don't know if 'cmp' or 'diff' is better.

"diff -q" does not seem to be found on all Unix versions.
 
M

Michele Dondi

Since an MD5 checksum is often shorter than the file it is taken of (it
would be pointless to use if it weren't), the statement "two files are
exactly the same iff the MD5 checksums are equal" is wrong. After all,

You're perfectly right. Re-reading what I wrote I realize that it
seems to suggest that equality of MD5 sums is a necessary condition
for equality of files, which indeed is _not_ the case[*]. I apologize
to the OP for the inexactness of my claim.

I meant (and still mean), and should have written qq|If with "if they
match" you mean "if they are _exactly_ the same" _and_ it's enough for
you to be _fairly confident_ (as opposed to "absolutely certain") that
they are the same then...|.

I stressed the "_exactly_ the same" point because the OP talked about
"match", which may have meant something different, for example (and
just to make an example), if he was referring to XML files (yes, I
know that this is doubtful) and with "match" he meant "containing the
same data".


[*] As, "simply", _difference_ of the sums is a necessary condition
for the _difference_ of the files.


Michele
 
M

Michele Dondi

You're perfectly right. Re-reading what I wrote I realize that it
seems to suggest that equality of MD5 sums is a necessary condition
for equality of files, which indeed is _not_ the case[*]. I apologize
to the OP for the inexactness of my claim.

Oh, but it *is* a *necessary* condition. What it isn't is a *sufficient*
condition. You're the first person I've seen who's reversed the sense of

When I first read your post my gut reaction at first was along the
lines of "what's this idiot saying?" but on a second thought I
realized that _I_ am the idiot...
those particular terms, though confusing the underlying logical
propositions is extremely common and leads to some nasty fallacies
(affirming the consequent and denying the antecedent).

Well, FWIW I assure you that I'm perfectly familiar with the concepts
and AFAICT I've never reversed them. Probably I was simply too tired
when I posted this, and I thank you for correcting my gross mistake.

I have already apologized for an error included in an apologizing.
explanation to another error. Hopefully this time I've kept low
profile enought that I shouldn't have added any more so that I won't
need to apologize once again...
;-)


Michele
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,166
Messages
2,570,907
Members
47,447
Latest member
TamiLai26

Latest Threads

Top