file IO

D

Darren Dale

Can anyone explain this?

I have a file called old.dat with two lines:

1
2

So it's 3 bytes long. I run the following:

import os
f = file('old.dat',mode='r')
olddata = f.readlines()
f.close()

f = file('new.dat',mode='w')
f.writelines(olddata)
f.close()

new.dat is now 4 bytes long. ???

I need to reformat and then save some data. Then I need to be able to
export the reformatted data to a spreadsheet-friendly format. But once I
have simply copied (trying to isolate the problem) the file using the
script above, my export function takes 10x as long as it would have with
the original file. And worse, the output has an extra newline character
added at the end of each line. Any suggestions would really be
appreciated, I am going a bit crazy trying to understand this.

Darren
 
D

Darren Dale

Darren said:
Can anyone explain this?

I have a file called old.dat with two lines:

1
2

So it's 3 bytes long. I run the following:

import os
f = file('old.dat',mode='r')
olddata = f.readlines()
f.close()

f = file('new.dat',mode='w')
f.writelines(olddata)
f.close()

new.dat is now 4 bytes long. ???

I need to reformat and then save some data. Then I need to be able to
export the reformatted data to a spreadsheet-friendly format. But once I
have simply copied (trying to isolate the problem) the file using the
script above, my export function takes 10x as long as it would have with
the original file. And worse, the output has an extra newline character
added at the end of each line. Any suggestions would really be
appreciated, I am going a bit crazy trying to understand this.

Darren

One more bit of info. The extra newline character is added to output
when I open the rewritten file like this:

import os
from mmap import mmap, ACCESS_READ
f = file('foobar.dat',mode='rU')
fd = f.fileno()
m = mmap(fd, os.fstat(fd).st_size, None, ACCESS_READ)
olddata = []
line = m.readline()
while line:
olddata.append(line)
line = m.readline()

using mmap to read the original datafile works. Any thoughts? I would
really like to stick with mmap, my datafiles are the right size to
really benefit.

Darren
 
J

Jeff Epler

Are you using Windows? That would mean the answer is almost certainly
"something to do with carriage returns and binary vs text mode". The
lack of a trailing newline on the last line of your example can also
make for additional trouble (though my tests on unix, with stdio, mmap,
and StringIO didn't ever give me a 4-byte file, windows might give you
the file "a\r\nb" when viewed in binary format, "a\nb" when viewed in
text format)

I doubt that the mmap module's readline knows whether the file was
opened in universal text mode---that's a pure Python invention, while
mmap takes a file descriptor.

On Unix, I don't find that a "while" loop with mmap.readline is any
faster than a "for" loop over a file:

[45426 lines, 409305 bytes]
$ timeit -s "..." "readspeed.read_stdio('/usr/share/dict/words')"
10 loops, best of 3: 34.9 msec per loop
$ timeit -s "..." "readspeed.read_mmap('/usr/share/dict/words')"
10 loops, best of 3: 107 msec per loop

[363416 lines, 3274440 bytes]
$ time python -c "import readspeed; readspeed.read_stdio('biggerfile.txt')"
real 0.372s user 0.331s sys 0.031s
$ time python -c "import readspeed; readspeed.read_mmap('biggerfile.txt')"
real 1.080s user 1.013s sys 0.021s

[2907328 lines, 26195520 bytes]
$ time python -c "import readspeed; readspeed.read_stdio('biggerfile.txt')"
real 2.603s user 2.308s sys 0.157s
$ time python -c "import readspeed; readspeed.read_mmap('biggerfile.txt')"
real 8.514s user 7.893s sys 0.153s

I didn't have any "bigger-than-RAM text files" around to test.

Testing "biggerfile.txt" with mode "rU" gives real 3.110s, so there is
some penalty from using universal newlines.

------------------------------------------------------------------------
# readspeed.py
from mmap import mmap, PROT_READ
import itertools, os

def consume(iterable):
for j in iterable: pass

def read_stdio(filename):
f = open(filename) # open(filename, "rU") is slightly slower
consume(f)

def read_mmap(filename):
f = open(filename)
fd = f.fileno()
m = mmap(fd, os.fstat(fd).st_size, prot=PROT_READ)
while 1:
if not m.readline(): break
------------------------------------------------------------------------

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFBDvnHJd01MZaTXX0RAoEKAJ9r/zUIJ2WXmFtFSi8LO8jo8AjCdACdFtUl
jz2rnP0xWsnIU8pmfFNeH6w=
=sw4W
-----END PGP SIGNATURE-----
 
D

Darren Dale

Jeff said:
Are you using Windows? That would mean the answer is almost certainly
"something to do with carriage returns and binary vs text mode". The
lack of a trailing newline on the last line of your example can also
make for additional trouble (though my tests on unix, with stdio, mmap,
and StringIO didn't ever give me a 4-byte file, windows might give you
the file "a\r\nb" when viewed in binary format, "a\nb" when viewed in
text format)

I doubt that the mmap module's readline knows whether the file was
opened in universal text mode---that's a pure Python invention, while
mmap takes a file descriptor.

I am using windows (for now), and reading files created on a Linux
machine. I think you are right, it has something to do with mmap and the
/r/n windows convention. Thank you (very much) for your response... I am
sane again.

Darren
 
C

Chris

Jeff Epler said:
Are you using Windows? That would mean the answer is almost certainly
"something to do with carriage returns and binary vs text mode". The
lack of a trailing newline on the last line of your example can also
make for additional trouble (though my tests on unix, with stdio, mmap,
and StringIO didn't ever give me a 4-byte file, windows might give you
the file "a\r\nb" when viewed in binary format, "a\nb" when viewed in
text format)

I doubt that the mmap module's readline knows whether the file was
opened in universal text mode---that's a pure Python invention, while
mmap takes a file descriptor.

On Unix, I don't find that a "while" loop with mmap.readline is any
faster than a "for" loop over a file:

[45426 lines, 409305 bytes]
$ timeit -s "..." "readspeed.read_stdio('/usr/share/dict/words')"
10 loops, best of 3: 34.9 msec per loop
$ timeit -s "..." "readspeed.read_mmap('/usr/share/dict/words')"
10 loops, best of 3: 107 msec per loop

[363416 lines, 3274440 bytes]
$ time python -c "import readspeed; readspeed.read_stdio('biggerfile.txt')"
real 0.372s user 0.331s sys 0.031s
$ time python -c "import readspeed; readspeed.read_mmap('biggerfile.txt')"
real 1.080s user 1.013s sys 0.021s

[2907328 lines, 26195520 bytes]
$ time python -c "import readspeed; readspeed.read_stdio('biggerfile.txt')"
real 2.603s user 2.308s sys 0.157s
$ time python -c "import readspeed; readspeed.read_mmap('biggerfile.txt')"
real 8.514s user 7.893s sys 0.153s

I didn't have any "bigger-than-RAM text files" around to test.

Testing "biggerfile.txt" with mode "rU" gives real 3.110s, so there is
some penalty from using universal newlines.

------------------------------------------------------------------------
# readspeed.py
from mmap import mmap, PROT_READ
import itertools, os

def consume(iterable):
for j in iterable: pass

def read_stdio(filename):
f = open(filename) # open(filename, "rU") is slightly slower
consume(f)

def read_mmap(filename):
f = open(filename)
fd = f.fileno()
m = mmap(fd, os.fstat(fd).st_size, prot=PROT_READ)
while 1:
if not m.readline(): break


I've come across this in C, now that I'm forced to work under XP
(Thank you, Cygwin!)

Open the file 'rb' or 'r+b' and you avoid the entire issue of newlines.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,202
Messages
2,571,057
Members
47,666
Latest member
selsetu

Latest Threads

Top