file IO

Darren Dale · Aug 3, 2004

Can anyone explain this?

I have a file called old.dat with two lines:

1
2

So it's 3 bytes long. I run the following:

import os
f = file('old.dat',mode='r')
olddata = f.readlines()
f.close()

f = file('new.dat',mode='w')
f.writelines(olddata)
f.close()

new.dat is now 4 bytes long. ???

I need to reformat and then save some data. Then I need to be able to
export the reformatted data to a spreadsheet-friendly format. But once I
have simply copied (trying to isolate the problem) the file using the
script above, my export function takes 10x as long as it would have with
the original file. And worse, the output has an extra newline character
added at the end of each line. Any suggestions would really be
appreciated, I am going a bit crazy trying to understand this.

Darren

Darren Dale · Aug 3, 2004

Darren said:
Can anyone explain this?

I have a file called old.dat with two lines:

1
2

So it's 3 bytes long. I run the following:

import os
f = file('old.dat',mode='r')
olddata = f.readlines()
f.close()

f = file('new.dat',mode='w')
f.writelines(olddata)
f.close()

new.dat is now 4 bytes long. ???

I need to reformat and then save some data. Then I need to be able to
export the reformatted data to a spreadsheet-friendly format. But once I
have simply copied (trying to isolate the problem) the file using the
script above, my export function takes 10x as long as it would have with
the original file. And worse, the output has an extra newline character
added at the end of each line. Any suggestions would really be
appreciated, I am going a bit crazy trying to understand this.

Darren

One more bit of info. The extra newline character is added to output
when I open the rewritten file like this:

import os
from mmap import mmap, ACCESS_READ
f = file('foobar.dat',mode='rU')
fd = f.fileno()
m = mmap(fd, os.fstat(fd).st_size, None, ACCESS_READ)
olddata = []
line = m.readline()
while line:
olddata.append(line)
line = m.readline()

using mmap to read the original datafile works. Any thoughts? I would
really like to stick with mmap, my datafiles are the right size to
really benefit.

Darren

Jeff Epler · Aug 3, 2004

Are you using Windows? That would mean the answer is almost certainly
"something to do with carriage returns and binary vs text mode". The
lack of a trailing newline on the last line of your example can also
make for additional trouble (though my tests on unix, with stdio, mmap,
and StringIO didn't ever give me a 4-byte file, windows might give you
the file "a\r\nb" when viewed in binary format, "a\nb" when viewed in
text format)

I doubt that the mmap module's readline knows whether the file was
opened in universal text mode---that's a pure Python invention, while
mmap takes a file descriptor.

On Unix, I don't find that a "while" loop with mmap.readline is any
faster than a "for" loop over a file:

[45426 lines, 409305 bytes]
$ timeit -s "..." "readspeed.read_stdio('/usr/share/dict/words')"
10 loops, best of 3: 34.9 msec per loop
$ timeit -s "..." "readspeed.read_mmap('/usr/share/dict/words')"
10 loops, best of 3: 107 msec per loop

[363416 lines, 3274440 bytes]
$ time python -c "import readspeed; readspeed.read_stdio('biggerfile.txt')"
real 0.372s user 0.331s sys 0.031s
$ time python -c "import readspeed; readspeed.read_mmap('biggerfile.txt')"
real 1.080s user 1.013s sys 0.021s

[2907328 lines, 26195520 bytes]
$ time python -c "import readspeed; readspeed.read_stdio('biggerfile.txt')"
real 2.603s user 2.308s sys 0.157s
$ time python -c "import readspeed; readspeed.read_mmap('biggerfile.txt')"
real 8.514s user 7.893s sys 0.153s

I didn't have any "bigger-than-RAM text files" around to test.

Testing "biggerfile.txt" with mode "rU" gives real 3.110s, so there is
some penalty from using universal newlines.

------------------------------------------------------------------------
# readspeed.py
from mmap import mmap, PROT_READ
import itertools, os

def consume(iterable):
for j in iterable: pass

def read_stdio(filename):
f = open(filename) # open(filename, "rU") is slightly slower
consume(f)

def read_mmap(filename):
f = open(filename)
fd = f.fileno()
m = mmap(fd, os.fstat(fd).st_size, prot=PROT_READ)
while 1:
if not m.readline(): break
------------------------------------------------------------------------

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFBDvnHJd01MZaTXX0RAoEKAJ9r/zUIJ2WXmFtFSi8LO8jo8AjCdACdFtUl
jz2rnP0xWsnIU8pmfFNeH6w=
=sw4W
-----END PGP SIGNATURE-----

Darren Dale · Aug 3, 2004

Jeff said:
Are you using Windows? That would mean the answer is almost certainly
"something to do with carriage returns and binary vs text mode". The
lack of a trailing newline on the last line of your example can also
make for additional trouble (though my tests on unix, with stdio, mmap,
and StringIO didn't ever give me a 4-byte file, windows might give you
the file "a\r\nb" when viewed in binary format, "a\nb" when viewed in
text format)

I doubt that the mmap module's readline knows whether the file was
opened in universal text mode---that's a pure Python invention, while
mmap takes a file descriptor.

I am using windows (for now), and reading files created on a Linux
machine. I think you are right, it has something to do with mmap and the
/r/n windows convention. Thank you (very much) for your response... I am
sane again.

Darren

Chris · Aug 4, 2004

Jeff Epler said:
Are you using Windows? That would mean the answer is almost certainly
"something to do with carriage returns and binary vs text mode". The
lack of a trailing newline on the last line of your example can also
make for additional trouble (though my tests on unix, with stdio, mmap,
and StringIO didn't ever give me a 4-byte file, windows might give you
the file "a\r\nb" when viewed in binary format, "a\nb" when viewed in
text format)

I doubt that the mmap module's readline knows whether the file was
opened in universal text mode---that's a pure Python invention, while
mmap takes a file descriptor.

On Unix, I don't find that a "while" loop with mmap.readline is any
faster than a "for" loop over a file:

[45426 lines, 409305 bytes]
$ timeit -s "..." "readspeed.read_stdio('/usr/share/dict/words')"
10 loops, best of 3: 34.9 msec per loop
$ timeit -s "..." "readspeed.read_mmap('/usr/share/dict/words')"
10 loops, best of 3: 107 msec per loop

[363416 lines, 3274440 bytes]
$ time python -c "import readspeed; readspeed.read_stdio('biggerfile.txt')"
real 0.372s user 0.331s sys 0.031s
$ time python -c "import readspeed; readspeed.read_mmap('biggerfile.txt')"
real 1.080s user 1.013s sys 0.021s

[2907328 lines, 26195520 bytes]
$ time python -c "import readspeed; readspeed.read_stdio('biggerfile.txt')"
real 2.603s user 2.308s sys 0.157s
$ time python -c "import readspeed; readspeed.read_mmap('biggerfile.txt')"
real 8.514s user 7.893s sys 0.153s

I didn't have any "bigger-than-RAM text files" around to test.

Testing "biggerfile.txt" with mode "rU" gives real 3.110s, so there is
some penalty from using universal newlines.

------------------------------------------------------------------------
# readspeed.py
from mmap import mmap, PROT_READ
import itertools, os

def consume(iterable):
for j in iterable: pass

def read_stdio(filename):
f = open(filename) # open(filename, "rU") is slightly slower
consume(f)

def read_mmap(filename):
f = open(filename)
fd = f.fileno()
m = mmap(fd, os.fstat(fd).st_size, prot=PROT_READ)
while 1:
if not m.readline(): break

I've come across this in C, now that I'm forced to work under XP
(Thank you, Cygwin!)

Open the file 'rb' or 'r+b' and you avoid the entire issue of newlines.

Scott David Daniels · Aug 4, 2004

Darren Dale wrote:

You might want to look at using ilines on your mmap'ed file's data.
<http://members.dsl-only.net/~daniels/ilines.html>
This will give you access to building a universal newline generator
from a 'block-o-characters' generator.

io module and pdf question	2	Jun 25, 2013
string u'hyv\xe4' to file as 'hyvä'	5	Dec 26, 2010
file io (lagged values) newbie question	7	Feb 20, 2007
Getting Error reading in JSON file	0	Apr 28, 2022
IO from text file	5	Jan 25, 2011
Python thinks file is empty	4	Nov 5, 2007
Using a DTSX file with GoDaddy	0	Apr 21, 2024
IO error	3	Oct 18, 2008

file IO

Darren Dale

Darren Dale

Jeff Epler

Darren Dale

Chris

Scott David Daniels

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads