Split single file into multiple files based on patterns

satyam · Oct 24, 2012

I have a text file like this

A1980JE39300007 2732 4195 12.527000
A1980JE39300007 3465 9720 22.000000
A1980JE39300007 1853 3278 12.500000
A1980JE39300007 2732 2732 187.500000
A1980JE39300007 19 4688 3.619000
A1980JE39300007 2995 9720 6.667000
A1980JE39300007 1603 9720 30.000000
A1980JE39300007 234 4195 42.416000
A1980JE39300007 2732 9720 18.000000
A1980KK18700010 130 303 4.985000
A1980KK18700010 7 4915 0.435000
A1980KK18700010 25 1620 1.722000
A1980KK18700010 25 186 0.654000
A1980KK18700010 50 130 3.199000
A1980KK18700010 186 3366 4.780000
A1980KK18700010 30 186 1.285000
A1980KK18700010 30 185 4.395000
A1980KK18700010 185 186 9.000000
A1980KK18700010 25 30 3.493000

I want to split the file and get multiple files like A1980JE39300007.txt and A1980KK18700010.txt, where each file will contain column2, 3 and 4.
Thanks
Satyam

Jason Friedman · Oct 24, 2012

I have a text file like this

A1980JE39300007 2732 4195 12.527000
A1980JE39300007 3465 9720 22.000000
A1980JE39300007 1853 3278 12.500000
A1980JE39300007 2732 2732 187.500000
A1980JE39300007 19 4688 3.619000
A1980KK18700010 30 186 1.285000
A1980KK18700010 30 185 4.395000
A1980KK18700010 185 186 9.000000
A1980KK18700010 25 30 3.493000

I want to split the file and get multiple files like A1980JE39300007.txt and A1980KK18700010.txt, where each file will contain column2, 3 and 4.

Unless your source file is very large this should be sufficient:

$ cat source
A1980JE39300007 2732 4195 12.527000
A1980JE39300007 3465 9720 22.000000
A1980JE39300007 1853 3278 12.500000
A1980JE39300007 2732 2732 187.500000
A1980JE39300007 19 4688 3.619000
A1980JE39300007 2995 9720 6.667000
A1980JE39300007 1603 9720 30.000000
A1980JE39300007 234 4195 42.416000
A1980JE39300007 2732 9720 18.000000
A1980KK18700010 130 303 4.985000
A1980KK18700010 7 4915 0.435000
A1980KK18700010 25 1620 1.722000
A1980KK18700010 25 186 0.654000
A1980KK18700010 50 130 3.199000
A1980KK18700010 186 3366 4.780000
A1980KK18700010 30 186 1.285000
A1980KK18700010 30 185 4.395000
A1980KK18700010 185 186 9.000000
A1980KK18700010 25 30 3.493000

$ python3
Python 3.2.3 (default, Sep 10 2012, 18:14:40)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information..... file_name, remainder = line.strip().split(None, 1)
.... with open(file_name + ".txt", "a") as writer:
.... print(remainder, file=writer)
....
$ ls *txt
A1980JE39300007.txt A1980KK18700010.txt

$ cat A1980JE39300007.txt
2732 4195 12.527000
3465 9720 22.000000
1853 3278 12.500000
2732 2732 187.500000
19 4688 3.619000
2995 9720 6.667000
1603 9720 30.000000
234 4195 42.416000
2732 9720 18.000000

David Hutto · Oct 24, 2012

I have a text file like this

A1980JE39300007 2732 4195 12.527000
A1980JE39300007 3465 9720 22.000000
A1980JE39300007 1853 3278 12.500000
A1980JE39300007 2732 2732 187.500000
A1980JE39300007 19 4688 3.619000
A1980JE39300007 2995 9720 6.667000
A1980JE39300007 1603 9720 30.000000
A1980JE39300007 234 4195 42.416000
A1980JE39300007 2732 9720 18.000000
A1980KK18700010 130 303 4.985000
A1980KK18700010 7 4915 0.435000
A1980KK18700010 25 1620 1.722000
A1980KK18700010 25 186 0.654000
A1980KK18700010 50 130 3.199000
A1980KK18700010 186 3366 4.780000
A1980KK18700010 30 186 1.285000
A1980KK18700010 30 185 4.395000
A1980KK18700010 185 186 9.000000
A1980KK18700010 25 30 3.493000

I want to split the file and get multiple files like A1980JE39300007.txt and A1980KK18700010.txt, where each file will contain column2, 3 and 4.
Thanks
Satyam

#parse through the lines
turn_text_to_txt = ['A1980JE39300007 2732 4195 12.527000',
'A1980JE39300007 3465 9720 22.000000',
'A1980JE39300007 1853 3278 12.500000',
'A1980JE39300007 2732 2732 187.500000',
'A1980JE39300007 19 4688 3.619000',
'A1980KK18700010 30 186 1.285000',
'A1980KK18700010 30 185 4.395000',
'A1980KK18700010 185 186 9.000000',
'A1980KK18700010 25 30 3.493000']
#then split and open a file for writing to create the file

#then start a count to add an extra number, because the files #you're
opening have the same name in some, which will #cause python to
overwrite the last file with that name.

#So I added an extra integer count after an underscore to #keep all
files, even if the have the first base number.

count = 0

for file_data in turn_text_to_txt:

#open the file for writing in 'w' mode so it creates the file, and
#adds in the appropriate data, including the extra count i#nteger just
in case there are files with the same name.

f = open('/home/david/files/%s_%s.txt' % (file_data.split(' ')[0], count), 'w')

#write the data to the file, however this is in list format, I could
go further, but need a little time for a few other things.

f.write( str(file_data.split(' ')[1:]))

#close the file
f.close()

#increment the count for the next iteration, if necessary, and #again,
this is just in case the files have the same name, and #need an
additive.
# count += 1

Full code from above, without comments:

turn_text_to_txt = ['A1980JE39300007 2732 4195 12.527000',
'A1980JE39300007 3465 9720 22.000000',
'A1980JE39300007 1853 3278 12.500000',
'A1980JE39300007 2732 2732 187.500000',
'A1980JE39300007 19 4688 3.619000',
'A1980KK18700010 30 186 1.285000',
'A1980KK18700010 30 185 4.395000',
'A1980KK18700010 185 186 9.000000',
'A1980KK18700010 25 30 3.493000']
#then split and open a file for writing to create the file
count = 0

for file_data in turn_text_to_txt:

print '/home/david/files/%s.txt' % (file_data.split(' ')[0])

f = open('/home/david/files/%s_%s.txt' % (file_data.split(' ')[0], count), 'w')

f.write( str(file_data.split(' ')[1:]))

f.close()

count += 1

Demian Brecht · Oct 24, 2012

count = 0

Don't use count.

for file_data in turn_text_to_txt:

Use enumerate:

for count, file_data in enumerate(turn_text_to_txt):

f = open('/home/david/files/%s_%s.txt' % (file_data.split(' ')[0], count), 'w')

Use with:

with open('file path', 'w') as f:
f.write('data')

Not only is it shorter, but it automatically closes the file once you've come out of the inner block, whether successfully or erroneously.

Demian Brecht
@demianbrecht
http://demianbrecht.github.com

Dennis Lee Bieber · Oct 24, 2012

$ python3
Python 3.2.3 (default, Sep 10 2012, 18:14:40)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.... file_name, remainder = line.strip().split(None, 1)
... with open(file_name + ".txt", "a") as writer:
... print(remainder, file=writer)

That's a lot of OS file open/closing operations...

I'd be more likely to configure the code as a "standard" "report
control break".

control = None
fin = open("source")
for line in fin:
newControl, data = line.split(None, 1) #leave new-line for output
if control != newControl: #only open/close files on
#change of control break
if control:
fout.close()
fout = open(newControl + ".txt", "a")
#I'd prefer using "w" IF the input is already sorted
#that way one knows a new file is created on each run
#instead of having to delete any existing files from
#previous runs
control = newControl
fout.write(data)
if control:
fout.close()
fin.close()

Alain Ketterlin · Oct 24, 2012

satyam said:
I have a text file like this

A1980JE39300007 2732 4195 12.527000
A1980JE39300007 3465 9720 22.000000
A1980JE39300007 2732 9720 18.000000
A1980KK18700010 130 303 4.985000
A1980KK18700010 7 4915 0.435000 [...]
I want to split the file and get multiple files like
A1980JE39300007.txt and A1980KK18700010.txt, where each file will
contain column2, 3 and 4.

Sorry for being completely off-topic here, but awk has a very convenient
feature to deal with this. Simply use:

awk '{ print $2,$3,$4 > $1".txt"; }' /path/to/your/file

-- Alain.

Mark Lawrence · Oct 24, 2012

satyam said:
satyam said:

I have a text file like this

A1980JE39300007 2732 4195 12.527000
A1980JE39300007 3465 9720 22.000000
A1980JE39300007 2732 9720 18.000000
A1980KK18700010 130 303 4.985000
A1980KK18700010 7 4915 0.435000 [...]
I want to split the file and get multiple files like
A1980JE39300007.txt and A1980KK18700010.txt, where each file will
contain column2, 3 and 4.

Click to expand...

Sorry for being completely off-topic here, but awk has a very convenient
feature to deal with this. Simply use:

awk '{ print $2,$3,$4 > $1".txt"; }' /path/to/your/file

-- Alain.

Although practicality beats purity

Steven D'Aprano · Oct 24, 2012

I have a text file like this

A1980JE39300007 2732 4195 12.527000 [...]

I want to split the file and get multiple files like A1980JE39300007.txt
and A1980KK18700010.txt, where each file will contain column2, 3 and 4.

Are you just excited and want to tell everyone, or do you actually have a
question?

Have you tried to write some code, or do you just expect others to do
your work for you?

If so, I see that your expectation was correct.

David Hutto · Oct 24, 2012

I have a text file like this

A1980JE39300007 2732 4195 12.527000 [...]

I want to split the file and get multiple files like A1980JE39300007.txt
and A1980KK18700010.txt, where each file will contain column2, 3 and 4.

Click to expand...

Are you just excited and want to tell everyone, or do you actually have a
question?

Have you tried to write some code, or do you just expect others to do
your work for you?

If so, I see that your expectation was correct.

Some learn better with a full example, better than any small challenge
that can be thrown in at certain times.

I think it should be a little of both, especially if you (an
algorithmitist for the OP)only have enough time to throw out untested
pseudo code.

Peter Otten · Oct 24, 2012

satyam said:
I have a text file like this

A1980JE39300007 2732 4195 12.527000
A1980JE39300007 3465 9720 22.000000
A1980JE39300007 1853 3278 12.500000
A1980JE39300007 2732 2732 187.500000
A1980JE39300007 19 4688 3.619000
A1980JE39300007 2995 9720 6.667000
A1980JE39300007 1603 9720 30.000000
A1980JE39300007 234 4195 42.416000
A1980JE39300007 2732 9720 18.000000
A1980KK18700010 130 303 4.985000
A1980KK18700010 7 4915 0.435000
A1980KK18700010 25 1620 1.722000
A1980KK18700010 25 186 0.654000
A1980KK18700010 50 130 3.199000
A1980KK18700010 186 3366 4.780000
A1980KK18700010 30 186 1.285000
A1980KK18700010 30 185 4.395000
A1980KK18700010 185 186 9.000000
A1980KK18700010 25 30 3.493000

I want to split the file and get multiple files like A1980JE39300007.txt
and A1980KK18700010.txt, where each file will contain column2, 3 and 4.
Thanks Satyam

import os
from itertools import groupby
from operator import itemgetter

get_key = itemgetter(0)
get_value = itemgetter(1)

output_folder = "tmp"
with open("infile.txt") as instream:
pairs = (line.split(None, 1) for line in instream)
for key, group in groupby(pairs, key=get_key):
path = os.path.join(output_folder, key + ".txt")
with open(path, "a") as outstream:
outstream.writelines(get_value(line) for line in group)

If you are running the code more than once make sure that you remove the
files from the previous run first.

writing on file not until the end	8	May 24, 2009
large array in a single line	5	May 26, 2009
Text files read multiple files into single file, and then recreate the multiple files	4	Feb 12, 2005
making a typing speed tester	2	Nov 14, 2007
Help wanted with md2 hash algorithm	12	Jan 6, 2006
Text processing	29	Sep 26, 2011
Multiple File Deletions Based on Creation Dates	3	Jan 3, 2005
Why not success write the data?	1	Aug 8, 2008

Split single file into multiple files based on patterns

satyam

Jason Friedman

David Hutto

Demian Brecht

Dennis Lee Bieber

Alain Ketterlin

Mark Lawrence

Steven D'Aprano

David Hutto

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads