File Handling Issues

I

Ian Esling

I have some code that polls directories, when it finds one or more
files in there it imports the contents into a database. Very
occasionally (like once in every 5000 files) it will pick up a file
and archive it without actually putting the contents into the
database, there's no error thrown anywhere and all the logging looks
like the file's been processed without a hitch. Putting the file back
into the import directory gets it imported and read correctly into the
database.

The logging consists of a couple of statements written to a logfile
saying what files it's found in what directory then the name of each
individual file as it processes them. There's also a table in the
database where we record the filenames, time of processing, how many
records were read in, how many contained errors etc. In the case of
this error occurring the log files look exactly how you'd expect if
the file had been imported correctly and the table in the database
shows it processed fine, however the numbers shown are zero (which is
actually correct, it did only process zero!) when it should have read
in at least one record.

The import process consists of moving each file to be processed from
the import directory into a working directory whilst it's worked on,
then moved again into an archive directory once it's done with.

I'm a bit baffled how this error could have occured without any
exceptions being thrown or logged, any suggestions welcome. I've got
a theory that it might be due to the file handling in the code, which
looks like this:

for (File file : filesToImport())
{
importFileHandlingExceptions(file);
}

public void importFileHandlingExceptions(File file)
{
log.debug("Importing file " + file);
try
{
importFile(file);
}
catch (Exception e)
{
handleImportException(file, e);
}
}

public void importFile(File file) throws IOException
{
file = workingDirectoryCreator.moveFileIntoDir(file);
importer.importFile(file, summaries);
file = archiveDirectoryCreator.moveFileIntoDir(file);
summaries.setArchiveFilename(file.getName());
}

public File moveFileIntoDir(File file) throws IOException
{
return moveToFile(file,
unusedFileFinder.findFile(file.getName()));
}

public static File moveToFile(File moveMe, File destinationFile)
throws IOException
{
boolean success = moveMe.renameTo(destinationFile);
if (!success)
{
throw new IOException("Unable to move " + inspectFile(moveMe)
+ " to " + inspectFile(destinationFile));
}
return destinationFile;
}

What I'm wondering is if it's due to picking up the file at the
beginning (in the for (File file... bit) and then the subsequent
processing is done on that variable being passed around. Occasionally
we might pick up that file before the process that ftps it into the
import directory has actually finished writing it, so at that moment
the file variable is actually holding an empty file. The subsequent
moving of the actual file works OK because they're small files and by
the time that code executes the ftp process has finished with it and
released it, but when we do our subsequent processing we're still
working on the original file variable?

I'm busy working on some test code to try and replicate this but
realise I could well be barking up the wrong tree, and not even sure
what I've just suggested could happen, hopefully someone out there has
encountered something similar to this and could share their experience?
 
M

Matt Humphrey

Ian Esling said:
I have some code that polls directories, when it finds one or more
files in there it imports the contents into a database. Very
occasionally (like once in every 5000 files) it will pick up a file
and archive it without actually putting the contents into the
database, there's no error thrown anywhere and all the logging looks
like the file's been processed without a hitch. Putting the file back
into the import directory gets it imported and read correctly into the
database.

The logging consists of a couple of statements written to a logfile
saying what files it's found in what directory then the name of each
individual file as it processes them. There's also a table in the
database where we record the filenames, time of processing, how many
records were read in, how many contained errors etc. In the case of
this error occurring the log files look exactly how you'd expect if
the file had been imported correctly and the table in the database
shows it processed fine, however the numbers shown are zero (which is
actually correct, it did only process zero!) when it should have read
in at least one record.

The import process consists of moving each file to be processed from
the import directory into a working directory whilst it's worked on,
then moved again into an archive directory once it's done with.

I'm a bit baffled how this error could have occured without any
exceptions being thrown or logged, any suggestions welcome. I've got
a theory that it might be due to the file handling in the code, which
looks like this:

for (File file : filesToImport())
{
importFileHandlingExceptions(file);
}

public void importFileHandlingExceptions(File file)
{
log.debug("Importing file " + file);
try
{
importFile(file);
}
catch (Exception e)
{
handleImportException(file, e);
}
}

public void importFile(File file) throws IOException
{
file = workingDirectoryCreator.moveFileIntoDir(file);
importer.importFile(file, summaries);
file = archiveDirectoryCreator.moveFileIntoDir(file);
summaries.setArchiveFilename(file.getName());
}

public File moveFileIntoDir(File file) throws IOException
{
return moveToFile(file,
unusedFileFinder.findFile(file.getName()));
}

public static File moveToFile(File moveMe, File destinationFile)
throws IOException
{
boolean success = moveMe.renameTo(destinationFile);
if (!success)
{
throw new IOException("Unable to move " + inspectFile(moveMe)
+ " to " + inspectFile(destinationFile));
}
return destinationFile;
}

What I'm wondering is if it's due to picking up the file at the
beginning (in the for (File file... bit) and then the subsequent
processing is done on that variable being passed around. Occasionally
we might pick up that file before the process that ftps it into the
import directory has actually finished writing it, so at that moment
the file variable is actually holding an empty file.

In this kind of file processing, it is a very common problem that the file
is detected before its contents have been fully added by whatever external
process puts the files there. It's not so much a problem with your program
as with the unsychronized conflict of the external writer and your reader.
In fact, although you are detecting this problem because you are finding
0-record entries in your database, the truth is that any of your apparently
correct non-empty files may be accidentally truncated because they are
processed before the sender has finished sending the file.

A technique that avoids this problem is to have the sender create the file
with a twist to the name, such as .TMP, or a leading underscore or
something. Set your program to ignore such files. When the sender has
finished copying the file, it renames the file to its correct name.
Renaming is typically an atomic file-system operation so that you will get
valid results no matter when your filesToImport() is called. The file will
always be fully present when your program finds it.

If you cannot modify the drop program to tell you when it's finished, you
can change your program to keep track of files and the date of last
modification. When that date stops changing, you can safely process the
file. I'm not a fan of that solution because you have to set some arbitrary
time limit as to what constitutes no change.
The subsequent
moving of the actual file works OK because they're small files and by
the time that code executes the ftp process has finished with it and
released it, but when we do our subsequent processing we're still
working on the original file variable?

Are you absolutely sure that some records were not missing from the end of
the file? Also (in case it's not clear) the file variable does not hold the
contents of the file. It's really just the file name so there's no reason
to believe the file contents will stay the same in the time between when you
get the reference (filesToImport()) and when you actually process the file.
I'm busy working on some test code to try and replicate this but
realise I could well be barking up the wrong tree, and not even sure
what I've just suggested could happen, hopefully someone out there has
encountered something similar to this and could share their experience?

If you want to try a good test, write a little program that drops the file
very slowly. Create the file, wait a few seconds, add a record (or
whatever), wait a few seconds, etc.

Matt Humphrey http://www.iviz.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top