copying files

K

Keith Thompson

Kleuske said:
Don't. It's ineffective and may open up your system to abuse.

It's entirely effective (but non-portable, since the command string
you pass to system() is entirely system-specific).

As for abuse, that's a good point. If someone creates a pdf file
named, say, "foo ; rm -rf ~ ; bar.pdf", then incorporating that name
into the string you pass to system() could be really bad; see also
Little Bobby Tables, <http://xkcd.com/327/>. But if you're aware of
the problem, you can avoid it; if nothing else, you can just ignore
any files whose names contain any characters outside a safe set.

Somebody spent a great deal of time and effort writing an efficient
file copying command for your operating system. Why reinvent
the wheel?
Try fopen, fread, fwrite and fclose. Use a big buffer, since PDF's
(especially with grahics) tend to be big.

fread and fwrite do their own buffering. Copying a character at a time
is likely to be about as efficient as using a big buffer. (Don't take
my word for that; measure it!)
Check for error codes.

Such as the value returned by system().
 
K

Keith Thompson

Hans Vlems said:
Keith, the system(s) we have to work with are connected to disks in a
way that seriously affects performance. I've seen a file copy last a
little over one minute, with a filesize of approx. 2 MB. A copy may
also just fail. The function described in the OP may be used for
several files, say about 20 and all at least 1 MB in size. That is
certainly tot impressive by todays standards and shouldn't be a
challenge for the underlying hardware. Unfortunately, these things do
fail occasionally and when a copy fails then system() does not signal
that failure. I'd rather know about it and hence the desire to
perform the copy in my own code. Malcolm's example demonstrates the
various possibilities to signal an error situation.

(Please don't quote signatures.)

If your OS file copying command can fail to copy the file and
not signal that failure, then you've got some serious problems.
I'm skeptical that reimplementing file copying in C will avoid
those problems. You might get better answers in a forum dedicated to
your operating system.

Personally, I probably wouldn't do something like this in C.
I'd probably use Perl, either with system() to invoke "cp", or
using Perl's standard File::Copy module (the latter isn't as prone
to "foo ; rm -rf ~ ; bar.pdf" attacks). Whether you use Perl or not,
this looks like a job for a scripting language.
 
K

Keith Thompson

Hans Vlems said:
Malcolm,
thanks for the example. Copying the input file one byte per read
(getch) operation may not be the fastest way,

(getc, fgetc, or getchar, not getch)

Don't assume that copying a byte at a time will be slow. stdio does its
own buffering internally; most calls to getc will just read a byte from
a memory buffer.
 
M

Malcolm McLean

Don't assume that copying a byte at a time will be slow.  stdio does its
own buffering internally; most calls to getc will just read a byte from
a memory buffer.
Yes, there's not much point implementing a buffer on top of another
buffer. You are doing the test for EOF on every loop iteration, but
that won't make much of a difference.
 
H

Hans Vlems

int        fcopy( char *infile, char *outfile ) {
FILE    *inptr = fopen(infile,"rb"),     /*open file for binary read*/
   *outptr = fopen(outfile,"wb");        /*and write*/
unsigned char buff[256];           /*block of bytes from infile*/
int     buflen=255, nread=0,nwrite=0,      /*#bytes we try to read/write*/
   nrw = 0;                        /*total bytes read/written*/
if ( inptr!=NULL && outptr!=NULL ) {       /*have opened files*/
  while ( 1 ) {                            /*read & write them till eof*/
    /* --- read bytes from infile --- */
    nread = fread(buff,sizeof(unsigned char),buflen,inptr); /*read*/

sizeof(unsigned char) is 1 by definition.
Is there a particular reason why you are reading one byte less than
the size of the buffer? It seems more logical to use the entire buffer:

  nread = fread(buff, 1, sizeof buff, inptr);
    if ( nread < 1 ) break;             /* no bytes left in file */
    /* --- write bytes to outfile --- */
    nwrite = fwrite(buff,sizeof(unsigned char),nread,outptr); /*write*/
    if ( nwrite != nread ) { nrw=(-1); break; } /*problem writing*/
    nrw += nwrite;                 /*total #bytes*/
    if ( nread < buflen ) break;        /* no bytes left infile */

  if (nread < sizeof buff) break;
    } /* --- end-of-while(1) --- */
  fclose(inptr); fclose(outptr);   /* close files */
  } /* --- end-of-if(fileptrs!=NULL) --- */
return ( nrw );                            /*back to caller with file size*/

Potential file descriptor leak: if one of the files could be opened
but the other couldn't, the opened file is not closed.
Also, in this case 0 is returned, so from the return value you
cannot distinguish between having fopen errors, and having completed
a successful copy of an empty file.






} /* --- end-of-function fcopy() --- */
...which I've snipped from some code that works (with a few
essentially cosmetic changes so it reads okay as a code fragment).

Hi Ike, what I did was a blend of both solutions: I didn't care for
the goto's and the lack of error handling. So the result is pretty
robust. My code doesn't even come near the Internet and system
security is not my concern. Poor platform design, however, is
unfortunately and in this day and age ....
Hans
 
H

Hans Vlems

Given your concerns about safety, you may well be working at the wrong
level.  It's perfectly possible for a C program (using standard C IO) to
signal success without a single byte of data hitting the disk.

If your concern for a safe copy is indeed paramount, you will have ask
about OS-level facilities in a suitable group.  Both the OS and the
file-system types that are involved in the copy may have a bearing on
how to get maximum safety.

The procedure, err function, must copy a file and signal an error when
it trips on one.
That's what it does now. It returns 9 different values, eight of them
indicate an error.
Whether anything ever hit spinning metal is another function's task to
figure out, and I agree
that is beyond the scope of this group.

Hans
beyond the scope of this group a
 
H

Hans Vlems

Avoid system() unless executing a "canned" command supplied by the user..
If you need to spawn a child process with specific arguments, use fork()
and exec*() rather than attempting to construct a shell command.

[...]

This, as well as the rest of your response, relies heavily on the
assumption that the OP is on a Unix-like system.

--
Keith Thompson (The_Other_Keith) (e-mail address removed)  <http://www.ghoti.net/~kst>
    Will write code for food.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

Keith, I have to deal with a Windows Server 2003 box which is hardly
at the top of my list of favorites :)
 
H

Hans Vlems

Hans Vlems said:
[...]>   fpin = fopen(source, "rb2");
[...]
Is "rb2" a typo?
Possibly, but the setting the proper filemode is not my main concern.

If you mean that it's a trivial thing to correct and you're not going to
get it wrong, that's fine.  If you mean that it doesn't matter, it most
certainly does.

--
Keith Thompson (The_Other_Keith) (e-mail address removed)  <http://www.ghoti.net/~kst>
    Will write code for food.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

How on earth could it have meant anything else than the first option???
 
H

Hans Vlems

[...]
Malcolm,
thanks for the example. Copying the input file one byte per read
(getch) operation may not be the fastest way,

(getc, fgetc, or getchar, not getch)

Don't assume that copying a byte at a time will be slow.  stdio does its
own buffering internally; most calls to getc will just read a byte from
a memory buffer.

--
Keith Thompson (The_Other_Keith) (e-mail address removed)  <http://www.ghoti.net/~kst>
    Will write code for food.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

Ah, the beauty of the verb "may"
 
K

Keith Thompson

Hans Vlems said:
Hans Vlems said:
Malcolm McLean <[email protected]> writes:
[...]>   fpin = fopen(source, "rb2");

Is "rb2" a typo?
Possibly, but the setting the proper filemode is not my main concern.

If you mean that it's a trivial thing to correct and you're not going to
get it wrong, that's fine.  If you mean that it doesn't matter, it most
certainly does.

How on earth could it have meant anything else than the first option???

Some people really are that careless. I presume you're not, but it's
hard to be 100% sure, and foolish to assume it. (No offense intended.)
 
H

Hans Vlems

Yes, there's not much point implementing a buffer on top of another
buffer. You are doing the test for EOF on every loop iteration, but
that won't make much of a difference.

Malcolm,
I tried to improve on performance and in an earlier reply I wrote that
the original code you provided was sufficiently fast for the files I
fed it. So it doesn't matter much for my purpose. I'm dealing with
fairly small files here: they're all below 10 MB. What I did want is
to know whether something went wrong or not during a copy operation.
Which is why I can't use system(). It signals the status of passing a
command (script) to the OS. I'd rather stay away from programming DOS
command scripts and error handling therein! IMHO that's why
programming languages were invented... So don't worry, your code waas
very helpful. It's the goto's that had to go...
Hans
 
K

Keith Thompson

Hans Vlems said:
I tried to improve on performance and in an earlier reply I wrote that
the original code you provided was sufficiently fast for the files I
fed it. So it doesn't matter much for my purpose. I'm dealing with
fairly small files here: they're all below 10 MB. What I did want is
to know whether something went wrong or not during a copy operation.
Which is why I can't use system(). It signals the status of passing a
command (script) to the OS. I'd rather stay away from programming DOS
command scripts and error handling therein! IMHO that's why
programming languages were invented... So don't worry, your code waas
very helpful. It's the goto's that had to go...

Calling fclose() and having it return zero to indicate success
doesn't actually guarantee that the bits ended up on the platter.
The C standards says "Any unwritten buffered data for the stream
are delivered to the host environment to be written to the file".

Whatever method you use to copy the files, the only way to be 100%
sure that the file was actually copied is to read the copy and
compare it to the original.
 
N

Nobody

Avoid system() unless executing a "canned" command supplied by the user.
If you need to spawn a child process with specific arguments, use fork()
and exec*() rather than attempting to construct a shell command.
[...]

This, as well as the rest of your response, relies heavily on the
assumption that the OP is on a Unix-like system.

True, but "equivalent" issues exist on other platforms. If anything,
Windows is even worse (both on the system() issue and on filesystem
semantics).
 
M

Malcolm McLean

On 16 feb, 19:11, Malcolm McLean <[email protected]>
What I did want is to know whether something went wrong or not during a copy operation.
fopen return null if it fails.
fgetc() returns EOF on end of file, or error.
So after fgetc() returns EOF, call feof() to see whether the EOF was
generated by a read error or by end of file.
fputc() returns EOF of fail.
fclose() returns EOF on failure. Most people don't bother to check
this. But it's possible to have write error as the last few bytes are
flushed out.

As Keith says, the only way to be absolutely sure the file is
physically written is to read it back in. Even then, it could become
corrupted on disk. But that sort of error is very rare. Almost always
the error will be because fopen() fails, either it has been passed a
bad filename or the device isn't working properly.
 
P

Philip Lantz

Keith said:
Calling fclose() and having it return zero to indicate success
doesn't actually guarantee that the bits ended up on the platter.
The C standards says "Any unwritten buffered data for the stream
are delivered to the host environment to be written to the file".

Whatever method you use to copy the files, the only way to be 100%
sure that the file was actually copied is to read the copy and
compare it to the original.

Reading the copy and comparing it to the original also doesn't actually
guarantee that the bits ended up on the platter.
 
W

Willem

Malcolm McLean wrote:
) As Keith says, the only way to be absolutely sure the file is
) physically written is to read it back in. Even then, it could become
) corrupted on disk.

Or still be hanging in the volatile write cache, still scheduled to be
actually written. Which is more more likely. And if you're writing to
a network share, all bets are off.

If you really really want to be sure a file is actually on the disk
platters, you need OS-specific support.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
H

Hans Vlems

fopen return null if it fails.
fgetc() returns EOF on end of file, or error.
So after fgetc() returns EOF, call feof() to see whether the EOF was
generated by a read error or by end of file.
fputc() returns EOF of fail.
fclose() returns EOF on failure. Most people don't bother to check
this. But it's possible to have write error as the last few bytes are
flushed out.

As Keith says, the only way to be absolutely sure the file is
physically written is to read it back in. Even then, it could become
corrupted on disk. But that sort of error is very rare. Almost always
the error will be because fopen() fails, either it has been passed a
bad filename or the device isn't working properly.

The error handling available for fputc() and fgetc() is indeed very
useful.
COPY&COMPARE is a valid strategy if you're running on unreliable
disks.
In the mean time I got more information from the ICT support people.
Their view
is that the disks themselves are in good shape, but the configuration
of the SAN
and the Citrix farm connected to it were both poorly designed.
They're working to improve the setup but money is tight. That said,
comparing the
two copies might be the best strategy for the moment.
Hans
 
H

Hans Vlems

Reading the copy and comparing it to the original also doesn't actually
guarantee that the bits ended up on the platter.

Correct. As others already mentioned here, there are multiple layares
of caches in the
path between data in memory and data on disk.
 
H

Hans Vlems

Malcolm McLean wrote:

) As Keith says, the only way to be absolutely sure the file is
) physically written is to read it back in. Even then, it could become
) corrupted on disk.

Or still be hanging in the volatile write cache, still scheduled to be
actually written.  Which is more more likely.  And if you're writing to
a network share, all bets are off.

If you really really want to be sure a file is actually on the disk
platters, you need OS-specific support.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
            made in the above text. For all I know I might be
            drugged or something..
            No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

OTOH Willem if you cannot rely on the firmware to put cached data
correctly on disk
then you're definitely in need of a better disk subsystem!
I'm not sure I understand your reference to "OS-specific
support"though.
Do you mean the existence of a record management system (e.g. RMS, a
VMS component)?
Windows and unix(es) don't have this feature and the programmer is
solely resposible
for the desired file structure.
A record management system may be called to report on errors etc. Was
that your point?
Hans
 
W

Willem

Hans Vlems wrote:
)> Malcolm McLean wrote:
)>
)> ) As Keith says, the only way to be absolutely sure the file is
)> ) physically written is to read it back in. Even then, it could become
)> ) corrupted on disk.
)>
)> Or still be hanging in the volatile write cache, still scheduled to be
)> actually written. ?Which is more more likely. ?And if you're writing to
)> a network share, all bets are off.
)>
)> If you really really want to be sure a file is actually on the disk
)> platters, you need OS-specific support.
)
) OTOH Willem if you cannot rely on the firmware to put cached data
) correctly on disk
) then you're definitely in need of a better disk subsystem!

I wasn't talking about the firmware layer, I was talking about
the OS write caching layer.

) I'm not sure I understand your reference to "OS-specific
) support"though.

For example, the sync() and fsync() call in several unix flavours.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,082
Messages
2,570,589
Members
47,212
Latest member
JaydenBail

Latest Threads

Top