Fastest way to split a file by columns?

K

Kevin

Just want to know what is the best way for this course coding task. :)

Task: to split a big file into many files by its columns. Each
resulting file consists one column of the original big file.

The original file can be as large as Gbytes (so that we can not hold it
in main memory). Each line has exact 200 columns, each column is ","
separated. The file is plain text file.

For example, if the input file is:

abc, ee, ef, ww, ee, wwe.
aas, we, 64, www, w3, 46.
qw, 35, qg, d4, q3, a34.
......
......

We need to break it into 6 files, first file is:
abc
aas
qw
....

Second file is:
ee
we
35
....

Third file is:
df
64
qg
....
etc.

My current method is:
1) create 200 file writers (each with 0.5M file buffer), each
corresponding to one column.
2) read in the original file (with 20M file buffer) line by line,
3) break each line with "," into tokens.
4) write out each token to its corresponding file writer.
This method, on a 1.2G input file, with 200 columns, will use about 14
minutes. The physical memory on the machine is 512M, so the above file
buffers should fit into the main memory.

Is that the fastest method we can get?

How about the task with reverse goal: if we have above resulting files,
and we need to merge them into one big file so that each column is from
each file. My code using a similar way as above needs 18 minutes.

I can't think out a better way. Any suggestions?
 
A

Andrey Kuznetsov

My current method is:
1) create 200 file writers (each with 0.5M file buffer), each
corresponding to one column.
2) read in the original file (with 20M file buffer) line by line,
3) break each line with "," into tokens.
4) write out each token to its corresponding file writer.
This method, on a 1.2G input file, with 200 columns, will use about 14
minutes. The physical memory on the machine is 512M, so the above file
buffers should fit into the main memory.

Is that the fastest method we can get?

How about the task with reverse goal: if we have above resulting files,
and we need to merge them into one big file so that each column is from
each file. My code using a similar way as above needs 18 minutes.

I can't think out a better way. Any suggestions?

at-first, you don't need such huge buffers - they can make you slow.
at-second, you can try to use Unified I/O - it makes all buffering for you.
In some cases Unified I/O may give you speedup up to 120 times.
If you use readLine, you become String, this can make it slow.
With Unified I/O you can read line to byte array.
Its free and open source.
 
R

Rhino

Kevin said:
Just want to know what is the best way for this course coding task. :)

Task: to split a big file into many files by its columns. Each
resulting file consists one column of the original big file.

The original file can be as large as Gbytes (so that we can not hold it
in main memory). Each line has exact 200 columns, each column is ","
separated. The file is plain text file.

For example, if the input file is:

abc, ee, ef, ww, ee, wwe.
aas, we, 64, www, w3, 46.
qw, 35, qg, d4, q3, a34.
.....
.....

We need to break it into 6 files, first file is:
abc
aas
qw
...

Second file is:
ee
we
35
...

Third file is:
df
64
qg
...
etc.

My current method is:
1) create 200 file writers (each with 0.5M file buffer), each
corresponding to one column.
2) read in the original file (with 20M file buffer) line by line,
3) break each line with "," into tokens.
4) write out each token to its corresponding file writer.
This method, on a 1.2G input file, with 200 columns, will use about 14
minutes. The physical memory on the machine is 512M, so the above file
buffers should fit into the main memory.

Is that the fastest method we can get?

How about the task with reverse goal: if we have above resulting files,
and we need to merge them into one big file so that each column is from
each file. My code using a similar way as above needs 18 minutes.

I can't think out a better way. Any suggestions?
In 25 years of working with computers professionally, a good bit of it as an
instructor of computer courses, I've never seen a requirement to do anything
like what you are talking about, namely slicing a file vertically. I suppose
the main intent is just to make you learn to handle memory and file I/O but
you'd think the instructor of your course could come up with a more
realistic problem that would help you develop those same skills....
 
J

Jeffrey Schwab

Rhino said:
....

In 25 years of working with computers professionally, a good bit of it as an
instructor of computer courses, I've never seen a requirement to do anything
like what you are talking about, namely slicing a file vertically. I suppose
the main intent is just to make you learn to handle memory and file I/O but
you'd think the instructor of your course could come up with a more
realistic problem that would help you develop those same skills....

?!

So you've never used /bin/cut? You've never had to parse one column out
of a table stored in a text file? You've never had to use blockwise
selection in nedit, emacs, or vi (control-v)?

For almost anyone who does any serious data munging, this comes up all
the time. Files are frequently stored with one record per line, and
each line divided into columns either according to fixed widths, or by
separators: commas in CSV, colons in /etc/passwd, etc. This is
actually one of the most useful and relevant examples I've ever heard of
being assigned to students. It's a heck of a lot more useful than
implementing a Red/Black Tree, or the Game of Life, or most of the other
busy-work I was given as an undergraduate.
 
C

Chris Uppal

In 25 years of working with computers professionally, a good bit of it as
an instructor of computer courses, I've never seen a requirement to do
anything like what you are talking about, namely slicing a file
vertically.

It the kind of thing that comes up quite often in my experience.

For instance.

Extracting data from almost any kind of log, discarding unwanted fields.

Extracting the useful information from compiler error messages.

Pulling relevant information from formatted files (/etc/passwd...).

....

-- chris
 
C

Chris Uppal

Kevin said:
Task: to split a big file into many files by its columns. Each
resulting file consists one column of the original big file.
My current method is:
1) create 200 file writers (each with 0.5M file buffer), each
corresponding to one column.
2) read in the original file (with 20M file buffer) line by line,
3) break each line with "," into tokens.
4) write out each token to its corresponding file writer.
This method, on a 1.2G input file, with 200 columns, will use about 14
minutes. The physical memory on the machine is 512M, so the above file
buffers should fit into the main memory.

Is that the fastest method we can get?

I suspect that, as Andrey has already said, your buffers are oversized. If you
arrived at those sizes by experiment then kudos to you, and please ignore the
rest of the paragraph. If not then I'd try using 1M for all of them, or even
half that. I've never seen a significant performance gain by using buffers
over a M (except when writing to a tape drive).

One thing to consider is that as you write the 200 output files, you will be
making the disk heads move a /lot/ and not sequentially, and that may well be
slowing your program down substantially -- or there again, it may not. I'd try
doing this in N passes, each extracting 200/N columns. I don't know what the
best value of N is, it may be 1 as you are currently assuming ;-) but I'd try
increasing it to -- say -- 20 and see what happens. It's even possible that
the optimum will be 200. Considering disk movements can be a big win -- I once
rewrote a single pass program that would have taken >10 hours to run, into a
multi-pass (near-)equivalent that ran in about 3 minutes...

A more sophisticated multi-pass version would be to read in the input file in
giant chunks (100 M, say), and then do multiple passes over that, appending the
data from that chunk to each output file in turn. You might find it worthwhile
splitting up the data in the chunk at the beginning rather than duplicating
that effort on each pass, but that would massively increase your object counts
and thus reduce the size of the chunk that you could use. A tradeoff and only
experiment (or careful /quantative/ analysis) can tell you which is better.

You may also be able to micro-optimise the splitting of the file into lines and
sub-strings, but I find it hard to believe that that would have any significant
effect in comparison with IO costs.

-- chris
 
R

Red Orchid

Message-ID: said:
Just want to know what is the best way for this course coding task. :)

Task: to split a big file into many files by its columns. Each
resulting file consists one column of the original big file.

I assume the followings:

a) Space is valid character.
b) Line divider is "\r\n" or '\n'.

Maybe, it is as like:

<code>
(this codes is not tested).

public class MainFrame {

static InputStream in;
static final int colNum = 200;
static OutputStream[] out = new OutputStream[colNum];

public static void main(String[] args) {

try {
initializeStream();

byte[] b = new byte[1];
ByteBuffer buf = new ByteBuffer(512);
int colIndex = 0;

while (true) {

if (in.read() < 0) {

break;
}
switch (b[0]) {
case ',':
out[colIndex++].write(buf.get(), 0, buf.size());
buf.clear();
break;

case '\r': // skip
break;

case '\n':
colIndex = 0;
break;

default:
buf.add(b[0]);
break;
}
}
}
catch (Exception e) {
// process exception.
}
finally {
closeStream();
}
}

static void initializeStream() throws Exception {

in = getInputStream();
Arrays.fill(out, null);

for (int i = 0; i < out.length; i++) {

out = getOutputStream(i);
}
}

static void closeStream() {

try {
if (in != null) in.close();

for (int i = 0; i < out.length; i++) {

if (out != null) out.close();
}
}
catch (Exception e) {
}
}

static InputStream getInputStream() throws Exception {

File f = new File("your input file");
return new BufferedInputStream(new FileInputStream(f));
}

static OutputStream getOutputStream(int index) throws Exception {

File f = new File("your output file" + index + ".txt");
return new BufferedOutputStream(new FileOutputStream(f));
}

static class ByteBuffer {

int pos;
byte[] buf;

public ByteBuffer(int initialCapacity) {

buf = new byte[initialCapacity];
}

void add(byte b) {

if (pos >= buf.length) {

resize();
}
buf[pos++] = b;
}

void resize() {

byte[] b = new byte[buf.length * 2];
System.arraycopy(buf, 0, b, 0, buf.length);

buf = b;
}

byte[] get() {

return buf;
}

int size() {

return pos;
}

void clear() {

pos = 0;
}
}
}

</code>
 
R

Rhino

Jeffrey Schwab said:
?!

So you've never used /bin/cut? You've never had to parse one column out
of a table stored in a text file? You've never had to use blockwise
selection in nedit, emacs, or vi (control-v)?

For almost anyone who does any serious data munging, this comes up all the
time. Files are frequently stored with one record per line, and each line
divided into columns either according to fixed widths, or by separators:
commas in CSV, colons in /etc/passwd, etc. This is actually one of the
most useful and relevant examples I've ever heard of being assigned to
students. It's a heck of a lot more useful than implementing a Red/Black
Tree, or the Game of Life, or most of the other busy-work I was given as
an undergraduate.

Okay, maybe I've taken the assignment too narrowly.

Yes, of course, I've often had to parse lines in files and use the relevant
bits for something; I've also had to load CSV files into databases and deal
with fixed-width columns on many occasions. And I agree that is a useful and
realistic thing to learn.

I just meant that I've never had to slice a file vertically so that each of
the many columns was split into separate files and then joined back together
at some point. I could see that it might be useful to slice out the primary
key column(s) into one file and then put the rest of the columns (combined)
into a second file and then have the key point to the data, as would be the
case in a hash table. But slicing each column into a separate file? I don't
see much point in that.
 
R

Rhino

Chris Uppal said:
It the kind of thing that comes up quite often in my experience.

For instance.

Extracting data from almost any kind of log, discarding unwanted fields.

Extracting the useful information from compiler error messages.

Pulling relevant information from formatted files (/etc/passwd...).

...
See my reply to Jeffery Schwab for a clarification of what I meant.
 
J

Jeffrey Schwab

Rhino said:
Okay, maybe I've taken the assignment too narrowly.

Yes, of course, I've often had to parse lines in files and use the relevant
bits for something; I've also had to load CSV files into databases and deal
with fixed-width columns on many occasions. And I agree that is a useful and
realistic thing to learn.

I just meant that I've never had to slice a file vertically so that each of
the many columns was split into separate files and then joined back together
at some point. I could see that it might be useful to slice out the primary
key column(s) into one file and then put the rest of the columns (combined)
into a second file and then have the key point to the data, as would be the
case in a hash table. But slicing each column into a separate file? I don't
see much point in that.

OK, I hear what you're saying. I actually have had to deal with this
case quite a bit, though.

I used to work with a lot of file formats representing digital
electronic signals. In some formats, each column represented a vector
of binary values; for example, a simulation of an AND gate might produce
a file like this, where C is the clock, A and B are inputs, and O is the
output:

CABO
0000
1010
0100
1111

This is how the Avanti/Synopsys .VEC digital format works, except for a
header at the top of the file giving more information, e.g. the number
of picoseconds per clock phase. I think the idea is that it's supposed
to look like a truth table.

Of course, other formats want all values in a given vector to be on the
same line:

C0101
A0011
B0101
O0001

So I ended up doing a lot of column-by-column extraction. Often,
different subsets of the signals would be grouped into separate files,
e.g. all inputs in stimulus file for a SPICE simulation.

And in case you're wondering, no, I'm not making this stuff up as I go. :)
 
K

Kevin

This code piece is really useful.
I tested, it improves the speed from 14 minutes (using FileReader and
BufferedReader) to 7 minutes on my 1.2G data file.
Thank you a lot. :)

PS: just replace the beginning with:
if (in.read(b) < 0) // miss the b there.
{
break;
}
and increased the buffer from default to 500k.
 
R

Roedy Green

I just meant that I've never had to slice a file vertically so that each of
the many columns was split into separate files and then joined back together
at some point.

That is the sort of thing you did all the time with unit record
equipment (punch cards) and later the programs that simulated it.

I wrote a suite of utilities that among other things simulated all the
unit record operations, letting you combine files by selecting columns
and merging, sorting, updating, filtering etc.

To my astonishment lawyers would glue these together 50 steps or so
long to accomplish their ends. They would think of each file
operation the way you or I would think of a local variable addition.
Their primitives worked on whole files of 200,000 to 4 million
records.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,992
Messages
2,570,220
Members
46,807
Latest member
ryef

Latest Threads

Top