Deciding whether two files are the same

S

SzH

Suppose that there is a program that takes two files as its command
line arguments. Is there a (cross platform) way to decide whether the
two files are the same? Simple string comparison is not enough as the
two files might be specified as "file.txt" and "./file.txt", or one of
them may be a symlink to the other.

[I've already posted this 30 min ago but it didn't show up in Google
Groups---sorry if some people get it twice.]
 
V

Victor Bazarov

SzH said:
Suppose that there is a program that takes two files as its command
line arguments. Is there a (cross platform) way to decide whether the
two files are the same? Simple string comparison is not enough as the
two files might be specified as "file.txt" and "./file.txt", or one of
them may be a symlink to the other.

What does it mean for two files to be "the same"? The same contents?
The same size? The same type? The same name? The same partial path?
The same owner? The same permissions? There is no definition of
"same" in C++ beyond "same object" when speaking of to what pointers
point. *You* need to define "same" when it comes to "files".

V
 
I

Inquirer

I guess SzH meant paths.
I have the same question. There is something like this in .NET
Framework(Path.Equals)
but I cannot find in Win32 API or any standard libs:(
 
M

Michael Rohan

What does it mean for two files to be "the same"? The same contents?
The same size? The same type? The same name? The same partial path?
The same owner? The same permissions? There is no definition of
"same" in C++ beyond "same object" when speaking of to what pointers
point. *You* need to define "same" when it comes to "files".

V

Hi,

Given the example in the original post, I believe the "realpath"
function on
Unix should be of use here. It will resolve symlinks, etc, giving the
real
path to the file. Two file paths resolve to the same file if the real
paths
are the same.

Take care,
Michael.
 
V

Victor Bazarov

Inquirer said:
and what would that be in Windows? anybody knows?:)

Somebody in a Windows programming newsgroup might... Here
we don't discuss platform-specific functionality, sorry.

V
 
E

Eric.Malenfant

Suppose that there is a program that takes two files as its command
line arguments. Is there a (cross platform) way to decide whether the
two files are the same? Simple string comparison is not enough as the
two files might be specified as "file.txt" and "./file.txt", or one of
them may be a symlink to the other.

Out of curiosity, I looked in Boost.Filesystem.

You may be interested in the "equivalent()" function:
http://www.boost.org/libs/filesystem/doc/tr2_proposal.html#Predicate-functions
 
M

mike3

What does it mean for two files to be "the same"? The same contents?
The same size? The same type? The same name? The same partial path?
The same owner? The same permissions? There is no definition of
"same" in C++ beyond "same object" when speaking of to what pointers
point. *You* need to define "same" when it comes to "files".

Based on his example it looks like he wants to know if two given
references refer to the same file.
 
M

mike3

Somebody in a Windows programming newsgroup might... Here
we don't discuss platform-specific functionality, sorry.

However, the original poster asked about cross-platform methods.
Do any exist?
 
S

SzH

What does it mean for two files to be "the same"? The same contents?
The same size? The same type? The same name? The same partial path?
The same owner? The same permissions? There is no definition of
"same" in C++ beyond "same object" when speaking of to what pointers
point. *You* need to define "same" when it comes to "files".

I meant the following definition of "same": Two files are the same if
when one is changed (e.g. written to), the other changes too. I did
not mean two files with the same content, but the same physical file
on the hard disk.

Suppose that we have a utility that reads from one file and writes to
the other. If by accident the same output file is used for both input
and output, then the contents of the file might get deleted, and the
data might be lost ... A simple check is to compare the file names,
but as I explained in the original message, this is not reliable.
 
V

Victor Bazarov

SzH said:
Cross those platforms which do have files and one can use
std::eek:fstream file(path_to_file); ?

No such mechanism exists in the language. Besides...

std::eek:fstream is a stream. It's just associated with an abstract
"file" that can be identified by another abstract - the "name".
It does not really have to be a permanent set of storage units on
some external storage device, as you are probably aware. The
definition you gave "if you change one, the other one changes as
well" has really no meaning to 'ofstream'. You can't detect
a change in the output. So we're likely talking about "files"
that not only can be written to, but also can be read from.

Again, not all systems have that. That's why you need to ask in
the newsgroup dedicated to your system to see if any system-
specific mechanisms are available, and then roll your own, which
for every target platform will be implemented in terms of each
system's special way of determining the equality of two files.

V
 
S

SzH

No such mechanism exists in the language. Besides...

std::eek:fstream is a stream. It's just associated with an abstract
"file" that can be identified by another abstract - the "name".
It does not really have to be a permanent set of storage units on
some external storage device, as you are probably aware.

No, I wasn't aware of that ... at least not in the case of ofstream
itself. I assumed that whenever it is available, it is associated
with a physical file, and it is possible to get into the trouble I
described above (open the same physical file twice) It seems that we
were not talking the same language. Okay, so there is no way to do it
with the standard library. That's a reasonable reply.
Again, not all systems have that. That's why you need to ask in
the newsgroup dedicated to your system to see if any system-
specific mechanisms are available, and then roll your own, which
for every target platform will be implemented in terms of each
system's special way of determining the equality of two files.


OK, I'll do that.

Sz
 
D

darko.trpceski

Suppose that there is a program that takes two files as its command
line arguments.  Is there a (cross platform) way to decide whether the
two files are the same?  Simple string comparison is not enough as the
two files might be specified as "file.txt" and "./file.txt", or one of
them may be a symlink to the other.

[I've already posted this 30 min ago but it didn't show up in Google
Groups---sorry if some people get it twice.]

This might have some logical solution.
I would imagine that if you open the file in "exclusive-write" mode
and try to open the other one you can check if the files are the same.
 
J

Jerry Coffin

and what would that be in Windows? anybody knows?:)

If you ask about GetFullPathName somehwere like comp.os.ms-
windows.programmer.win32, chances of getting useful help will be much
better.
 
J

James Kanze

Hi,

Given the example in the original post, I believe the "realpath"
function on
Unix should be of use here. It will resolve symlinks, etc, giving the
real
path to the file. Two file paths resolve to the same file if the real
paths
are the same.

Take care,
Michael.
 
J

James Kanze

What does it mean for two files to be "the same"? The same
contents? The same size? The same type? The same name?
The same partial path? The same owner? The same
permissions? There is no definition of "same" in C++ beyond
"same object" when speaking of to what pointers point.
*You* need to define "same" when it comes to "files".
[/QUOTE]
I meant the following definition of "same": Two files are the
same if when one is changed (e.g. written to), the other
changes too. I did not mean two files with the same content,
but the same physical file on the hard disk.

This has been discussed in comp.std.c++ with regards to #pragma
once. The problem is very, very difficult, if not impossible to
solve in the absolute---you'd have to find out which system is
serving the files, and ask it. And even then, it can be tricky,
since more than one file server can be running on the host.
Suppose that we have a utility that reads from one file and
writes to the other. If by accident the same output file is
used for both input and output, then the contents of the file
might get deleted, and the data might be lost ... A simple
check is to compare the file names, but as I explained in the
original message, this is not reliable.

Two frequent solutions for that problem: refuse to overwrite an
existing file, or always write to a temporary, renaming it when
you're through.
 
A

Alf P. Steinbach

* James Kanze:
Two frequent solutions for that problem: refuse to overwrite an
existing file, or always write to a temporary, renaming it when
you're through.

That's not really a solution, and I hate programs that do that.

Not that I have any general solution, either, but consider:

f1 and f2 are two names for the same file (a.k.a. "hardlinks"),

your program writes file ftemp, deletes f1, renames ftemp to f1,

f1 and f2 are now /not/ two names for the same file.

E.g., in practice, f1 might be "checkable.html" and f2 might be
"runnable.hta" (because W3C validator doesn't like ".hta").


Cheers, & hope this might cause people to think twice before doing that
write-temp-delete-and-rename thing, even if it is a little off-topic!,

- Alf
 
J

James Kanze

* James Kanze:
That's not really a solution, and I hate programs that do that.

Which one, refusing to overwrite a file that exists, or going
through a temporary. The first is a question of taste. I'll
admit that with shells that offer this feature for redirected
output, I turn it off. But if you're worried about someone
foolishly specifying the same file for input and output, then it
might be appropriate. I use the second a lot, in cases where I
expect to overwrite the input. (But in such cases, either the
program always overwrites the input, or the user tells me
explicitly that he wants to replace the input.)
Not that I have any general solution, either, but consider:
f1 and f2 are two names for the same file (a.k.a. "hardlinks"),
your program writes file ftemp, deletes f1, renames ftemp to f1,
f1 and f2 are now /not/ two names for the same file.

Yes. Hard links introduce any number of such problems. When I
implement such use of a temporary under Unix, I check for them,
and use physical copy.
E.g., in practice, f1 might be "checkable.html" and f2 might be
"runnable.hta" (because W3C validator doesn't like ".hta").
Cheers, & hope this might cause people to think twice before
doing that write-temp-delete-and-rename thing, even if it is a
little off-topic!,

It's a useful remark, however, since we tend to forget. And
I'll bet quite a number of programs don't handle the case
correctly.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,999
Messages
2,570,243
Members
46,836
Latest member
login dogas

Latest Threads

Top