Splitting a file from specific column content

Y

Yigit Turgut

Hi all,

I have a text file approximately 20mb in size and contains about one
million lines. I was doing some processing on the data but then the
data rate increased and it takes very long time to process. I import
using numpy.loadtxt, here is a fragment of the data ;

0.000006 -0.0004
0.000071 0.0028
0.000079 0.0044
0.000086 0.0104
..
..
..

First column is the timestamp in seconds and second column is the
data. File contains 8seconds of measurement, and I would like to be
able to split the file into 3 parts seperated from specific time
locations. For example I want to divide the file into 3 parts, first
part containing 3 seconds of data, second containing 2 seconds of data
and third containing 3 seconds. Splitting based on file size doesn't
work that accurately for this specific data, some columns become
missing and etc. I need to split depending on the column content ;

1 - read file until first character of column1 is 3 (3 seconds)
2 - save this region to another file
3 - read the file where first characters of column1 are between 3 to
5 (2 seconds)
4 - save this region to another file
5 - read the file where first characters of column1 are between 5 to
5 (3 seconds)
6 - save this region to another file

I need to do this exactly because numpy.loadtxt or genfromtxt doesn't
get well with missing columns / rows. I even tried the invalidraise
parameter of genfromtxt but no luck.

I am sure it's a few lines of code for experienced users and I would
appreciate some guidance.
 
R

Roy Smith

Yigit Turgut said:
Hi all,

I have a text file approximately 20mb in size and contains about one
million lines. I was doing some processing on the data but then the
data rate increased and it takes very long time to process. I import
using numpy.loadtxt, here is a fragment of the data ;

0.000006 -0.0004
0.000071 0.0028
0.000079 0.0044
0.000086 0.0104
.
.
.

First column is the timestamp in seconds and second column is the
data. File contains 8seconds of measurement, and I would like to be
able to split the file into 3 parts seperated from specific time
locations. For example I want to divide the file into 3 parts, first
part containing 3 seconds of data, second containing 2 seconds of data
and third containing 3 seconds.

I would do this with standard unix tools:

grep '^[012]' input.txt > first-three-seconds.txt
grep '^[34]' input.txt > next-two-seconds.txt
grep '^[567]' input.txt > next-three-seconds.txt

Sure, it makes three passes over the data, but for 20 MB of data, you
could have the whole job done in less time than it took me to type this.

As a sanity check, I would run "wc -l" on each of the files and confirm
that they add up to the original line count.
 
M

MRAB

Hi all,

I have a text file approximately 20mb in size and contains about one
million lines. I was doing some processing on the data but then the
data rate increased and it takes very long time to process. I import
using numpy.loadtxt, here is a fragment of the data ;

0.000006 -0.0004
0.000071 0.0028
0.000079 0.0044
0.000086 0.0104
.
.
.

First column is the timestamp in seconds and second column is the
data. File contains 8seconds of measurement, and I would like to be
able to split the file into 3 parts seperated from specific time
locations. For example I want to divide the file into 3 parts, first
part containing 3 seconds of data, second containing 2 seconds of data
and third containing 3 seconds. Splitting based on file size doesn't
work that accurately for this specific data, some columns become
missing and etc. I need to split depending on the column content ;

1 - read file until first character of column1 is 3 (3 seconds)
2 - save this region to another file
3 - read the file where first characters of column1 are between 3 to
5 (2 seconds)
4 - save this region to another file
5 - read the file where first characters of column1 are between 5 to
5 (3 seconds)
6 - save this region to another file

I need to do this exactly because numpy.loadtxt or genfromtxt doesn't
get well with missing columns / rows. I even tried the invalidraise
parameter of genfromtxt but no luck.

I am sure it's a few lines of code for experienced users and I would
appreciate some guidance.
Here's a solution in Python 3:

input_path = "..."
section_1_path = "..."
section_2_path = "..."
section_3_path = "..."

with open(input_path) as input_file:
try:
line = next(input_file)

# Copy section 1.
with open(section_1_path, "w") as output_file:
while line[0] < "3":
output_file.write(line)
line = next(input_file)

# Copy section 2.
with open(section_2_path, "w") as output_file:
while line[5] < "5":
output_file.write(line)
line = next(input_file)

# Copy section 3.
with open(section_3_path, "w") as output_file:
while True:
output_file.write(line)
line = next(input_file)
except StopIteration:
pass
 
A

Arnaud Delobelle

Here's a solution in Python 3:

input_path = "..."
section_1_path = "..."
section_2_path = "..."
section_3_path = "..."

with open(input_path) as input_file:
   try:
       line = next(input_file)

       # Copy section 1.
       with open(section_1_path, "w") as output_file:
           while line[0] < "3":
               output_file.write(line)
               line = next(input_file)

       # Copy section 2.
       with open(section_2_path, "w") as output_file:
           while line[5] < "5":
               output_file.write(line)
               line = next(input_file)

       # Copy section 3.
       with open(section_3_path, "w") as output_file:
           while True:
               output_file.write(line)
               line = next(input_file)
   except StopIteration:
       pass

Or more succintly (but not tested):


sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]

with open(input_path) as input_file:
lines = iter(input_file)
for end, path in sections:
with open(path, "w") as output_file:
for line in lines:
if line >= end:
break
output_file.write(line)
 
M

MRAB

Here's a solution in Python 3:

input_path = "..."
section_1_path = "..."
section_2_path = "..."
section_3_path = "..."

with open(input_path) as input_file:
try:
line = next(input_file)

# Copy section 1.
with open(section_1_path, "w") as output_file:
while line[0]< "3":
output_file.write(line)
line = next(input_file)

# Copy section 2.
with open(section_2_path, "w") as output_file:
while line[5]< "5":
output_file.write(line)
line = next(input_file)

# Copy section 3.
with open(section_3_path, "w") as output_file:
while True:
output_file.write(line)
line = next(input_file)
except StopIteration:
pass

Or more succintly (but not tested):


sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]

with open(input_path) as input_file:
lines = iter(input_file)
for end, path in sections:
with open(path, "w") as output_file:
for line in lines:
if line>= end:
break
output_file.write(line)
Consider the condition "line >= end".

If it's true, then control will break out of the inner loop and start
the inner loop again, getting the next line.

But what of the line which caused it to break out? It'll be lost.
 
Y

Yigit Turgut

I have a text file approximately 20mb in size and contains about one
million lines. I was doing some processing on the data but then the
data rate increased and it takes very long time to process. I import
using numpy.loadtxt, here is a fragment of the data ;
0.000006 -0.0004
0.000071 0.0028
0.000079 0.0044
0.000086 0.0104
.
.
.
First column is the timestamp in seconds and second column is the
data. File contains 8seconds of measurement, and I would like to be
able to split the file into 3 parts seperated from specific time
locations. For example I want to divide the file into 3 parts, first
part containing 3 seconds of data, second containing 2 seconds of data
and third containing 3 seconds.

I would do this with standard unix tools:

grep '^[012]' input.txt > first-three-seconds.txt
grep '^[34]' input.txt > next-two-seconds.txt
grep '^[567]' input.txt > next-three-seconds.txt

Sure, it makes three passes over the data, but for 20 MB of data, you
could have the whole job done in less time than it took me to type this.

As a sanity check, I would run "wc -l" on each of the files and confirm
that they add up to the original line count.

This works and is very fast but it missed a few hundred lines
unfortunately.

I have a text file approximately 20mb in size and contains about one
million lines. I was doing some processing on the data but then the
data rate increased and it takes very long time to process. I import
using numpy.loadtxt, here is a fragment of the data ;
0.000006 -0.0004
0.000071 0.0028
0.000079 0.0044
0.000086 0.0104
.
.
.
First column is the timestamp in seconds and second column is the
data. File contains 8seconds of measurement, and I would like to be
able to split the file into 3 parts seperated from specific time
locations. For example I want to divide the file into 3 parts, first
part containing 3 seconds of data, second containing 2 seconds of data
and third containing 3 seconds. Splitting based on file size doesn't
work that accurately for this specific data, some columns become
missing and etc. I need to split depending on the column content ;
1 - read file until first character of column1 is 3 (3 seconds)
2 - save this region to another file
3 - read the file where first characters of column1 are between 3 to
5 (2 seconds)
4 - save this region to another file
5 - read the file where first characters of column1 are between 5 to
5 (3 seconds)
6 - save this region to another file
I need to do this exactly because numpy.loadtxt or genfromtxt doesn't
get well with missing columns / rows. I even tried the invalidraise
parameter of genfromtxt but no luck.
I am sure it's a few lines of code for experienced users and I would
appreciate some guidance.

Here's a solution in Python 3:

input_path = "..."
section_1_path = "..."
section_2_path = "..."
section_3_path = "..."

with open(input_path) as input_file:
try:
line = next(input_file)

# Copy section 1.
with open(section_1_path, "w") as output_file:
while line[0] < "3":
output_file.write(line)
line = next(input_file)

# Copy section 2.
with open(section_2_path, "w") as output_file:
while line[5] < "5":
output_file.write(line)
line = next(input_file)

# Copy section 3.
with open(section_3_path, "w") as output_file:
while True:
output_file.write(line)
line = next(input_file)
except StopIteration:
pass

With the following correction ;

while line[5] < "5":
should be
while line[0] < "5":

This works well.

Here's a solution in Python 3:
input_path = "..."
section_1_path = "..."
section_2_path = "..."
section_3_path = "..."
with open(input_path) as input_file:
try:
line = next(input_file)
# Copy section 1.
with open(section_1_path, "w") as output_file:
while line[0] < "3":
output_file.write(line)
line = next(input_file)
# Copy section 2.
with open(section_2_path, "w") as output_file:
while line[5] < "5":
output_file.write(line)
line = next(input_file)
# Copy section 3.
with open(section_3_path, "w") as output_file:
while True:
output_file.write(line)
line = next(input_file)
except StopIteration:
pass

Or more succintly (but not tested):

sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]

with open(input_path) as input_file:
lines = iter(input_file)
for end, path in sections:
with open(path, "w") as output_file:
for line in lines:
if line >= end:
break
output_file.write(line)

Good idea. Especially when dealing with variable numbers of sections.
But somehow I got ;

("5", "section_2")
TypeError: 'tuple' object is not callable
 
M

MRAB

On 22/01/2012 16:17, Yigit Turgut wrote:
[snip]
Or more succintly (but not tested):

sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]

with open(input_path) as input_file:
lines = iter(input_file)
for end, path in sections:
with open(path, "w") as output_file:
for line in lines:
if line>= end:
break
output_file.write(line)

Good idea. Especially when dealing with variable numbers of sections.
But somehow I got ;

("5", "section_2")
TypeError: 'tuple' object is not callable
That's due to missing commas:

sections = [
("3", "section_1"),
("5", "section_2"),
("\xFF", "section_3")
]
 
Y

Yigit Turgut

On 22/01/2012 16:17, Yigit Turgut wrote:
[snip]






 Or more succintly (but not tested):
 sections = [
     ("3", "section_1")
     ("5", "section_2")
     ("\xFF", "section_3")
 ]
 with open(input_path) as input_file:
     lines = iter(input_file)
     for end, path in sections:
         with open(path, "w") as output_file:
             for line in lines:
                 if line>= end:
                     break
                 output_file.write(line)
 --
 Arnaud
Good idea. Especially when dealing with variable numbers of sections.
But somehow  I got ;
     ("5", "section_2")
TypeError: 'tuple' object is not callable

That's due to missing commas:

sections = [
     ("3", "section_1"),
     ("5", "section_2"),
     ("\xFF", "section_3")
]

Thank you.
 
R

Roy Smith

I stand humbled.

I would do this with standard unix tools:

grep '^[012]' input.txt> first-three-seconds.txt
grep '^[34]' input.txt> next-two-seconds.txt
grep '^[567]' input.txt> next-three-seconds.txt

Sure, it makes three passes over the data, but for 20 MB of data, you
could have the whole job done in less time than it took me to type this.


If you wanted to do it in one pass using standard unix tools, you can use:

sed -n -e'/^[0-2]/w first-three.txt' -e'/^[34]/w next-two.txt' -e'/^[5-7]/w next-three.txt'

-tkc
 
T

Tim Chase

If you wanted to do it in one pass using standard unix
tools, you can use:

sed -n -e'/^[0-2]/w first-three.txt' -e'/^[34]/w
next-two.txt' -e'/^[5-7]/w next-three.txt'
I stand humbled.

In all likelyhood, you stand *younger*, not so much humbled ;-)

-tkc
 
R

Roy Smith

If you wanted to do it in one pass using standard unix
tools, you can use:

sed -n -e'/^[0-2]/w first-three.txt' -e'/^[34]/w
next-two.txt' -e'/^[5-7]/w next-three.txt'
I stand humbled.

In all likelyhood, you stand *younger*, not so much humbled ;-)


Oh, yeah? That must explain my grey hair and bifocals. I go back to Unix v6 in 1977. Humbled it is.
 
A

Arnaud Delobelle

On 22/01/2012 15:39, Arnaud Delobelle wrote: [...]
Or more succintly (but not tested):


sections = [
    ("3", "section_1")
    ("5", "section_2")
    ("\xFF", "section_3")
]

with open(input_path) as input_file:
    lines = iter(input_file)
    for end, path in sections:
        with open(path, "w") as output_file:
            for line in lines:
                if line>= end:
                    break
                output_file.write(line)
Consider the condition "line >= end".

If it's true, then control will break out of the inner loop and start
the inner loop again, getting the next line.

But what of the line which caused it to break out? It'll be lost.

Of course you're correct - my reply was too rushed. Here's a
hopefully working version (but still untested :).

sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]

with open(input_path) as input_file:
line, lines = "", iter(input_file)
for end, path in sections:
with open(path, "w") as output_file:
output_file.write(line)
for line in lines:
if line >= end:
break
output_file.write(line)
 
Y

Yigit Turgut

If you wanted to do it in one pass using standard unix
tools, you can use:
sed -n -e'/^[0-2]/w first-three.txt' -e'/^[34]/w
next-two.txt' -e'/^[5-7]/w next-three.txt'
I stand humbled.
In all likelyhood, you stand *younger*, not so much humbled ;-)

Oh, yeah?  That must explain my grey hair and bifocals.   I go back to Unix v6 in 1977.  Humbled it is.

Those times were much better IMHO :)
 
E

Eelco

The grep solution is not cross-platform, and not really an answer to a
question about python.

The by-line iteration examples are inefficient and bad practice from a
numpy/vectorization perspective.

I would advice to do it the numpythonic way (untested code):

breakpoints = [3, 5, 7]
data = np.loadtxt('data.txt')
time = data[:,0]
indices = np.searchsorted(time, breakpoints)
chunks = np.split(data, indices, axis=0)
for i, d in enumerate(chunks):
np.savetxt('data'+str(i)+'.txt', d)

Not sure how it compared to the grep solution in terms of performance,
but that should be quite a non-issue for 20mb of data, and its sure to
blow the by-line iteration out of the water. If you want to be more
efficient, you are going to have to cut the text-to-numeric parsing
out of the loop, which is the vast majority of the computational load
here; but if thats possible at all depends on how structured your
timestamps are; there must be a really compelling performance gain to
justify throwing the elegance of the np.split based solution out of
the window, in my opinion.
 
M

MRAB

On 22/01/2012 15:39, Arnaud Delobelle wrote: [...]
Or more succintly (but not tested):


sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]

with open(input_path) as input_file:
lines = iter(input_file)
for end, path in sections:
with open(path, "w") as output_file:
for line in lines:
if line>= end:
break
output_file.write(line)
Consider the condition "line>= end".

If it's true, then control will break out of the inner loop and start
the inner loop again, getting the next line.

But what of the line which caused it to break out? It'll be lost.

Of course you're correct - my reply was too rushed. Here's a
hopefully working version (but still untested :).

sections = [
("3", "section_1")
("5", "section_2")
("\xFF", "section_3")
]
[snip]
Missing commas! :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top