K
Kevin
Just want to know what is the best way for this course coding task.
Task: to split a big file into many files by its columns. Each
resulting file consists one column of the original big file.
The original file can be as large as Gbytes (so that we can not hold it
in main memory). Each line has exact 200 columns, each column is ","
separated. The file is plain text file.
For example, if the input file is:
abc, ee, ef, ww, ee, wwe.
aas, we, 64, www, w3, 46.
qw, 35, qg, d4, q3, a34.
......
......
We need to break it into 6 files, first file is:
abc
aas
qw
....
Second file is:
ee
we
35
....
Third file is:
df
64
qg
....
etc.
My current method is:
1) create 200 file writers (each with 0.5M file buffer), each
corresponding to one column.
2) read in the original file (with 20M file buffer) line by line,
3) break each line with "," into tokens.
4) write out each token to its corresponding file writer.
This method, on a 1.2G input file, with 200 columns, will use about 14
minutes. The physical memory on the machine is 512M, so the above file
buffers should fit into the main memory.
Is that the fastest method we can get?
How about the task with reverse goal: if we have above resulting files,
and we need to merge them into one big file so that each column is from
each file. My code using a similar way as above needs 18 minutes.
I can't think out a better way. Any suggestions?
Task: to split a big file into many files by its columns. Each
resulting file consists one column of the original big file.
The original file can be as large as Gbytes (so that we can not hold it
in main memory). Each line has exact 200 columns, each column is ","
separated. The file is plain text file.
For example, if the input file is:
abc, ee, ef, ww, ee, wwe.
aas, we, 64, www, w3, 46.
qw, 35, qg, d4, q3, a34.
......
......
We need to break it into 6 files, first file is:
abc
aas
qw
....
Second file is:
ee
we
35
....
Third file is:
df
64
qg
....
etc.
My current method is:
1) create 200 file writers (each with 0.5M file buffer), each
corresponding to one column.
2) read in the original file (with 20M file buffer) line by line,
3) break each line with "," into tokens.
4) write out each token to its corresponding file writer.
This method, on a 1.2G input file, with 200 columns, will use about 14
minutes. The physical memory on the machine is 512M, so the above file
buffers should fit into the main memory.
Is that the fastest method we can get?
How about the task with reverse goal: if we have above resulting files,
and we need to merge them into one big file so that each column is from
each file. My code using a similar way as above needs 18 minutes.
I can't think out a better way. Any suggestions?