Deciding whether two files are the same

Michal Nazarewicz · Jan 26, 2008

I'd say the only way is to read both files byte-by-byte and compare them.

Michael Rohan said:
Given the example in the original post, I believe the "realpath"
function on Unix should be of use here. It will resolve symlinks,
etc, giving the real path to the file. Two file paths resolve to the
same file if the real paths are the same.

It won't resolve hard links.

Michal Nazarewicz · Jan 26, 2008

SzH said:
No, I wasn't aware of that ... at least not in the case of ofstream
itself.

As an example, there are plenty of "files" in unix that you can write to
and they don't have a permanent storage. They are mostly located in
/dev directory. A /dev/null is one example -- you can write anything
into it and when you read you get an EOF.

Michal Nazarewicz · Jan 26, 2008

Michal Nazarewicz said:
I'd say the only way is to read both files byte-by-byte and compare
them.

Oh.. you mean "the same" not "equal". Then the above approach won't
work since it would give false positives.

Pavel · Jan 26, 2008

SzH said:
I meant the following definition of "same": Two files are the same if
when one is changed (e.g. written to), the other changes too. I did
not mean two files with the same content, but the same physical file
on the hard disk.

Suppose that we have a utility that reads from one file and writes to
the other. If by accident the same output file is used for both input
and output, then the contents of the file might get deleted, and the
data might be lost ... A simple check is to compare the file names,
but as I explained in the original message, this is not reliable.

On the file systems complying to UNIX conventions (which is where
harlinks are mostly met), you could compare the file system and inode.
Now, a perfect comparison of file systems is a challenge in itself but
often you can reasonably know the files belong to the same file system
(if they are in the same directory, for example, and not symlinks).
Again, on UNIX, not only "disk" files but also others (including
/dev/null mentioned by someone) have a unique inode (within the file
system). Non-UNIX systems must have their own specific convention for
swufid (system-wide unique file id); Windows is an interesting hybrid
in this regard.

All this said, even within UNIX conventions you may disguise the
"sameness" of the files (one easy way is to mount the same NFS file
system more than once, addressing it by IPs of different NICs of the
same file server). Difficult to believe but I saw things like this
happening in production IT environment.

So, to summarize, the problem is most probably unsolvable in general,
even for a single platform, unless the computer system management
follows some rules; if they do, you have to write your own C++ "file
identity comparator" based on those specific rules. Which might be good
for a C++ developer's job security as s/he will always have something to
do

.

James Kanze · Jan 27, 2008

No such mechanism exists in the language. Besides...

std:fstream is a stream. It's just associated with an
abstract "file" that can be identified by another abstract -
the "name". It does not really have to be a permanent set of
storage units on some external storage device, as you are
probably aware. The definition you gave "if you change one,
the other one changes as well" has really no meaning to
'ofstream'. You can't detect a change in the output. So
we're likely talking about "files" that not only can be
written to, but also can be read from.

Again, not all systems have that. That's why you need to ask
in the newsgroup dedicated to your system to see if any
system- specific mechanisms are available, and then roll your
own, which for every target platform will be implemented in
terms of each system's special way of determining the equality
of two files.

All hosted systems do have something you can write to; writing
to it has some meaning, even if the standard is (intentionally)
very vague about what it means. And it does make some sense,
most of the time, to talk about whether the actual sink is
identical or not. (One could argue about cases like /dev/null
or /dev/tty, I suppose. In the first case, even on different
machines, the data ends up in the same place. On the second,
despite having a singl inode, and major and minor device numbers
under Unix, writing to it from two different processes may end
up in different places.) And the same way C++ supports the
concept of "opening" a file, it could support some sort function
which specifies whether two "names" refer to the same sink or
not.

It doesn't of course, so we're back where we started from.
Unless someone knows of a good, portable library which supports
it. (Note, however, that a 100% accurate answer isn't
necessarily possible under all systems. Such functionality
can't be implemented under Unix or under Windows, for example.)

James Kanze · Jan 27, 2008

SzH wrote:

[...]

On the file systems complying to UNIX conventions (which is
where harlinks are mostly met), you could compare the file
system and inode.

The fact that the file systems are different doesn't mean that
you don't have the same file.

[...]

All this said, even within UNIX conventions you may disguise the
"sameness" of the files (one easy way is to mount the same NFS file
system more than once, addressing it by IPs of different NICs of the
same file server). Difficult to believe but I saw things like this
happening in production IT environment.

I've rarely seen production enviroments where it wasn't the
case. (In one case, a collegue, waiting for a compile, decided
to "clean up" some, and deleted all of the files belonging to
him in /tmp. For reasons related to the way remote backups were
handled, his home directory was also mounted in /tmp. The
results were not very pleasant.)

Michal Nazarewicz · Jan 27, 2008

Pavel said:
On the file systems complying to UNIX conventions (which is where
harlinks are mostly met), you could compare the file system and
inode. Now, a perfect comparison of file systems is a challenge in
itself but often you can reasonably know the files belong to the same
file system (if they are in the same directory, for example, and not
symlinks).

It's fairly simple. All you have to do is stat(2) the files and compare
st_dev and st_ino fields of stat structure returned by those calls.

James Kanze · Jan 28, 2008

It's fairly simple. All you have to do is stat(2) the files and compare
st_dev and st_ino fields of stat structure returned by those calls.

If they're the same, the files are part of the same file system
(I'm pretty sure). If they're different, you don't know.

Michal Nazarewicz · Jan 28, 2008

James Kanze said:
If they're the same, the files are part of the same file system
(I'm pretty sure). If they're different, you don't know.

If they are different either their inode number or device number
differ. If both inode and device number are the same the files are the
same. The problem is that you don't know if the files are different if
either inode number or device number differ (as it was discussed earlier
on an example of NFS directory mounted using two different IPs).

James Kanze · Jan 28, 2008

If they are different either their inode number or device number
differ.

If they differ in their inode number, they are different. If
the device number differs, they might be different, or they
might not be. It's a fairly frequent occurence for the same
file system to be mounted with different inode numbers.

If both inode and device number are the same the files are the
same. The problem is that you don't know if the files are
different if either inode number or device number differ (as
it was discussed earlier on an example of NFS directory
mounted using two different IPs).

That's what I've been saying, and contradicts what you first
said. I think that if the inode numbers are different, the
files are different, but I've seen identical files with
different device numbers.

Pavel · Jan 29, 2008

Suppose that there is a program that takes two files as its command
line arguments. Is there a (cross platform) way to decide whether the
two files are the same? Simple string comparison is not enough as the
two files might be specified as "file.txt" and "./file.txt", or one of
them may be a symlink to the other.

[I've already posted this 30 min ago but it didn't show up in Google
Groups---sorry if some people get it twice.]

Click to expand...

This might have some logical solution.
I would imagine that if you open the file in "exclusive-write" mode
and try to open the other one you can check if the files are the same.

What is someone else was entertaining herself opening one of the files
in "exclusive-write" mode while we were doing same?

-Pavel

Jerry Coffin · Jan 29, 2008

On Jan 27, 2:24 pm, Michal Nazarewicz <[email protected]> wrote:

[ ... ]

If they're the same, the files are part of the same file system
(I'm pretty sure). If they're different, you don't know.

That depends a bit on viewpoint. Quite a few distributed file systems
provide a situation in which what's logically considered a single file
resides on a number of different machines. I.e. you have one logical
file system living on top of a number of physical file systems (so to
speak).

Such a system normally provides some unambiguous way to identify a file
(necessary for its own bookkeeping) but using it isn't portable. Each
system normally has a proxy entry in its own file system, so comparing
files on that system works just fine -- but two device/inode pairs on
two separate systems might actually refer to the same file so writes to
one will show up when reading the other.

James Kanze · Jan 29, 2008

[ ... ]

If they're the same, the files are part of the same file system
(I'm pretty sure). If they're different, you don't know.

Click to expand...

Click to expand...

That depends a bit on viewpoint. Quite a few distributed file systems
provide a situation in which what's logically considered a single file
resides on a number of different machines. I.e. you have one logical
file system living on top of a number of physical file systems (so to
speak).

Click to expand...

Such a system normally provides some unambiguous way to identify a file
(necessary for its own bookkeeping) but using it isn't portable. Each
system normally has a proxy entry in its own file system, so comparing
files on that system works just fine -- but two device/inode pairs on
two separate systems might actually refer to the same file so writes to
one will show up when reading the other.

Click to expand...

I'm not sure that that's relevant here. Regardless of where the
files reside, you get all of the files below a single mount
point from a single file server. And you can always get the
same files, mounted elsewhere, through a different server, or a
different connection to the same server. Files accessed through
different mount points have different device numbers.

Note that Windows has similar problems. I don't know the
Windows equivalents of inode numbers and device numbers, but you
can certainly mount the same file through different mount
points, either using SMB or using NFS. And as far as I can
tell, the protocols really provide no way of determining where
the file really comes from.

Jerry Coffin · Jan 30, 2008

[ ... ]

I'm not sure that that's relevant here. Regardless of where the
files reside, you get all of the files below a single mount
point from a single file server. And you can always get the
same files, mounted elsewhere, through a different server, or a
different connection to the same server. Files accessed through
different mount points have different device numbers.

In a distributed file system, you generally have several servers that
all carry the same files, and one file might be accessible from a number
of different servers.

In most cases, you have at least some degree of location transparency --
i.e. it'll typically support some sort of path that gets resolved to a
server/file combination by file system itself. In most cases, however,
you can also access those files directly from the individual servers as
well...

Note that Windows has similar problems. I don't know the
Windows equivalents of inode numbers and device numbers, but you
can certainly mount the same file through different mount
points, either using SMB or using NFS. And as far as I can
tell, the protocols really provide no way of determining where
the file really comes from.

Oh, absolutely -- I certainly didn't intend to imply that this was
unique to Unix by any means. I just used Unix terminology because that
was already being used in the thread. The same basic problem can arise
in many different systems, though it's also true that there really
aren't that many different OSes any more -- most of what's left is
Windows and various clones of Unix (and somebody who previously dealt
with substantially different systems could be forgiven for thinking of
Windows as a Unix clone...)

Michal Nazarewicz · Jan 30, 2008

James Kanze said:
If they differ in their inode number, they are different. If
the device number differs, they might be different, or they
might not be. It's a fairly frequent occurence for the same
file system to be mounted with different inode numbers.

I'm not saying that's not the case.

That's what I've been saying, and contradicts what you first
said. I think that if the inode numbers are different, the
files are different, but I've seen identical files with
different device numbers.

And I've never said anything different.

How to know that two pyc files contain the same code	1	Mar 10, 2012
Checking if two files are the same	7	Mar 7, 2011
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	4	Jun 4, 2023
FAQ 4.44 How do I test whether two arrays or hashes are equal?	2	Apr 20, 2011
Handling different implementations of the same algorithm	1	Aug 10, 2011
My code to determine whether two words are anagrams won't work.	20	Jun 21, 2009
Compare, two identical numbers are not the same?!	8	Jun 29, 2009
find out whther byte two .pyc files contain the same byte code.	1	May 5, 2010

Deciding whether two files are the same

Michal Nazarewicz

Michal Nazarewicz

Michal Nazarewicz

Pavel

James Kanze

James Kanze

Michal Nazarewicz

James Kanze

Michal Nazarewicz

James Kanze

Pavel

Jerry Coffin

James Kanze

Jerry Coffin

Michal Nazarewicz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads