Efficiently concatenating contents of multiple files

S

sasuke

Hello to all Java programmers out there. :)

I was just wondering what would be the most time / space efficient way
of concatenating contents of different files to a single file. Sample
usage would be:
java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...

Using threads to open a stream to the source files is out of question
since the data needs to be written in a ordered manner in which it
exists in the source files i.e. no ad hoc writing. Reading the entire
contents of the file into memory (by using a StingBuffer /
StringBuilder) also isn't a good choice considering that we can come
across really large text files (~10 MB, typical for db dumps). Reading
the source file line by line doesn't seem attractive given that it
would increase I/O and again for really large files might turn out to
be a I/O bottleneck. One solution which comes to mind is to read the
file in chunks; i.e. read the data in char array of 8KB or a string
array of size 100.

My question here is -» Is there any ideal solution which comes to
mind when solving this problem or does the solution really depend on
the domain in consideration and the kind of sacrifices we are ready to
make (e.g. lose the ordering of data, memory trade off when reading
entire file in a buffer, I/O hit)?

Pardon me for asking such trivial / silly question but just a
thought. :)

Regards,
/~sasuke
 
R

RedGrittyBrick

sasuke said:
Hello to all Java programmers out there. :)

I was just wondering what would be the most time / space efficient way
of concatenating contents of different files to a single file. Sample
usage would be:
java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...

The most efficient usage of your time is not to reinvent wheels.
Using threads to open a stream to the source files is out of question
since the data needs to be written in a ordered manner in which it
exists in the source files i.e. no ad hoc writing.

Having multiple threads doing I/O to the same disk is likely to slow
things down.

Reading the entire
contents of the file into memory (by using a StingBuffer /
StringBuilder) also isn't a good choice considering that we can come
across really large text files (~10 MB, typical for db dumps).

I see no benefit in reading a whole file into memory.

Reading
the source file line by line doesn't seem attractive given that it
would increase I/O and again for really large files might turn out to
be a I/O bottleneck.

You don't need the JVM to be doing conversion to UTC-16, or pointless
line-oriented processing (e.g. scanning for line-endings).

One solution which comes to mind is to read the
file in chunks; i.e. read the data in char array of 8KB or a string
array of size 100.

My question here is -» Is there any ideal solution which comes to
mind when solving this problem

:)

cat sourceFileOne.txt sourceFileTwo.txt ... targetFile.txt

or

copy sourceFileOne.txt+sourceFileTwo.txt ... targetFile.txt

depending on operating system
or does the solution really depend on
the domain in consideration and the kind of sacrifices we are ready to
make (e.g. lose the ordering of data, memory trade off when reading
entire file in a buffer, I/O hit)?


I wouldn't reinvent this wheel but if you are doing it I suggest you
treat the files as binary not as text (especially not using anything
that translates encodings). Reading in large fixed-size chunks would
seem to be sensible. Given that the task is I/O bound I wouldn't try too
hard to optimise anything else.
 
A

Abhijat Vatsyayan

sasuke said:
Hello to all Java programmers out there. :)

I was just wondering what would be the most time / space efficient way
of concatenating contents of different files to a single file. Sample
usage would be:
java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...

Using threads to open a stream to the source files is out of question
since the data needs to be written in a ordered manner in which it
exists in the source files i.e. no ad hoc writing. Reading the entire
contents of the file into memory (by using a StingBuffer /
StringBuilder) also isn't a good choice considering that we can come
across really large text files (~10 MB, typical for db dumps). Reading
the source file line by line doesn't seem attractive given that it
would increase I/O and again for really large files might turn out to
be a I/O bottleneck. One solution which comes to mind is to read the
file in chunks; i.e. read the data in char array of 8KB or a string
array of size 100.

My question here is -» Is there any ideal solution which comes to
mind when solving this problem or does the solution really depend on
the domain in consideration and the kind of sacrifices we are ready to
make (e.g. lose the ordering of data, memory trade off when reading
entire file in a buffer, I/O hit)?

Pardon me for asking such trivial / silly question but just a
thought. :)

Regards,
/~sasuke
Why not use concat task that comes with ant? Or if you can use shell on
a nix box, use "cat". Or install cat binary from cygwin on the windows
box (the list goes on). There are many solutions out there, the least
recommended being writing something like this from scratch (unless you
are doing this just for learning or for fun).
Abhijat
 
S

sasuke

Thanks to all for their replies. True, when programming we must seek
real life solutions to real world problems and the only efficient way
here seems to be making use of platform specific trickery.

I also completely agree with the general consensus that reading /
writing raw bytes in much more faster than reading in bytes,
converting them into string for a given or default encoding, writing
the string to the target file which will again be decoded into a byte
array based on the encoding.

A few queries though:

What encoding are your text files in? If the source and target files are
in the same encoding, and do not have a BOM character at the beginning of
the file, then a binary transfer is the way to go. Take a look at
java.nio.channels.FileChannel.transferTo / transferFromhttp://java.sun.com/javase/6/docs/api/java/nio/channels/FileChannel.h...,
long, java.nio.channels.WritableByteChannel)

Isn't this method an abstract method? So it implies that I need to
subclass this class and create my own specialized class which deals
with the content transfer? I wonder how that is any different from
doing it the raw way...
If you need to deal with different encodings (from your example usage, you
might check to see if your source files were using different BOMs), then
reading a block of characters (decoding from source), and writing them
back to the target (encoding them with the target file's encoding) may be
more appropriate. If they all have the same encoding, but use BOMs, then
you can use a binary transfer, skipping the BOM character from all but the
first source file.

BOM? Googling says that this is some sort of Byte order mark but I
don't think I have ever worked with BOM files before. If this is some
special byte which occurs at the start of every file (like some sort
of header) I wonder how you can call them plain text files?

Your inputs are much appreciated.

Thanks and regards,
/sasuke
 
R

Roedy Green

I was just wondering what would be the most time / space efficient way
of concatenating contents of different files to a single file. Sample
usage would be:
java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...

1. If you want a platform-specific solution, you could spawn a command
processor shell.

2. The simplest code would just be to read each file with a
BufferedReader using a whacking huge buffersize and write in turn to a
bufered output. see http://mindprod.com/applet/fileio.html for
sameple code. That has needless overhead for converting from bytes to
char and back, though it theory you could concatenate files of
different encodings if you knew what they were.

3. if you read the files as raw bytes rather than chars, you know
their precise lengths, and the offset where they will fit in the final
file. You could use random access to implement your thread idea.
However, I doubt the game will be worth the candle unless the files to
be gathered live on different _physical_ drives. All you will succeed
in doing is jerking the heads all over.

4. If you want a canned solution, use the FileTransfer class
downloadable from
http://mindprod.com/products.html#FILETRANSFER

It does it rapidly in large raw-byte chunks.

// test FileTransfer.append
import com.mindprod.filetransfer.FileTransfer;
import java.io.File;
public class Concat
{
/**
* test harness to concatenate c onto the end of b, leaving the
result in a.
*
* @param args not used
*/
public static void main ( String[] args )
{
File a = new File ("C:/temp/temp.txt"); // does not exist yet
File b = new File ("E:/mindprod/feedback/peaceincorrect.html");
File c = new File ("E:/mindprod/jgloss/j.html");

FileTransfer ft = new FileTransfer ( 50000 /* buffsize */ );
// source, target
ft.append( b, a );
ft.append( c, a );
}
}
 
T

Tom Anderson

1. If you want a platform-specific solution, you could spawn a command
processor shell.

2. The simplest code would just be to read each file with a
BufferedReader using a whacking huge buffersize and write in turn to a
bufered output. see http://mindprod.com/applet/fileio.html for sameple
code. That has needless overhead for converting from bytes to char and
back, though it theory you could concatenate files of different
encodings if you knew what they were.

I think what i'd do is memory-map the input file using NIO, and then write
the entire thing to the output in one go. And then cross my fingers and
hope that the OS was smart enough to do the right thing here, rather than
attempting to load the whole input file into memory first. If it does do
the right thing, this avoids a lot of copying of bytes to and from java,
and might even avoid any copying across the kernel/userspace border.

But yeah, running 'cat' is the right solution here.

tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top