reading file contents to an array (newbie)

D

Darren Dale

Hello,

I am trying to learn how to read a delimited file into a numarray. I
have a working strategy, but it doesnt seem very elegant: manually
changing whitespace delimiters to commas, evaluating the lines to a list
of tuples, creating a numarray object from that list.

Could I get some suggestions on how to do this more Pythonically? I have
to read pretty large files, so this approach is probably way to slow.
Here is my code:

from numarray import *
myFile=file('test.dat',mode='rt')
tempData=myFile.readlines()
data=[]
for line in tempData:
line=line.replace(' ',',')
line=line.replace('\n','')
data.append(eval(line))
data=array(data)

Thanks...
 
J

Jeff Epler

Here's my solution, using 2.3's "csv" module. Unfortunately, it holds
the whole array in Python lists to pass to the array() constructor.


$ cat dale.ssv
1 2 3
4.0 5.9 6.0e23
$ python dale.py
[[ 1.00000000e+00 2.00000000e+00 3.00000000e+00]
[ 4.00000000e+00 5.90000000e+00 6.00000000e+23]]
$ cat dale.py
import csv
from numarray import array, Float64

def sniff(f, delimeters=None):
sample = "".join(f.readlines(3))
f.seek(0, 0)
return csv.Sniffer().sniff(sample, delimeters)

def file_to_array(f, dialect=None, kind=Float64, conv=float):
if dialect is None:
dialect = sniff(f)
l = csv.reader(f, dialect)
return array([map(conv, row) for row in l], kind)

print file_to_array(open("dale.ssv"))

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFA6zu/Jd01MZaTXX0RAmBuAKCUCzBJILirPEHhorb2tf2s0z2HYwCgitYz
dqeUEifjEEx6WFf0yrqoksU=
=+5kn
-----END PGP SIGNATURE-----
 
C

Christopher T King

Could I get some suggestions on how to do this more Pythonically? I have
to read pretty large files, so this approach is probably way to slow.
Here is my code:

from numarray import *
myFile=file('test.dat',mode='rt')
tempData=myFile.readlines()
data=[]
for line in tempData:
line=line.replace(' ',',')
line=line.replace('\n','')
data.append(eval(line))
data=array(data)

First speedup:

Rather than replacing spaces with commas and evaluating the output, use
str.split to split the line up into pieces:

for line in tempData:
temp=[]
for value in line.split():
temp.append(int(value))
data.append(temp)

Second speedup:

Rewrite what I just wrote about using a list comprehension. This is a bit
harder to read, but much more efficient:

for line in tempData:
data.append([int(value) for value in line.split()])

Third speedup:

You don't need to read all the data in from the file beforehand; rather,
you can just write this:

myFile=file('test.dat',mode='rt')
data=[]
for line in myFile:
data.append([float(value) for value in line.split()])

Fourth speedup:

Replace the entire for loop with another list comprehension: (This is
starting to get a bit ridiculous, sorry :))

myFile=file('test.dat',mode='rt')
data=[[float(value) for value in line.split()] for line in myFile]

For great readability (at the cost of some speed), I might suggest writing
the above using a nested function, so your final output looks like this:

from numarray import *

def parseline(line):
return [float(value) for value in line.split()]

myFile=file('test.dat',mode='rt')
data=array([parseline(line) for line in myFile])

Hope this helps (and my quick intro to list comprehensions was somewhat
understandable :p).
 
J

John Lenton

For great readability (at the cost of some speed), I might suggest writing
the above using a nested function, so your final output looks like this:

from numarray import *

def parseline(line):
return [float(value) for value in line.split()]

myFile=file('test.dat',mode='rt')
data=array([parseline(line) for line in myFile])

actually, I find the following more readable, and even faster:

from mmap import mmap, MAP_PRIVATE, PROT_READ
from os import fstat

f = file('test.dat',mode='rt')
fd = f.fileno()
m = mmap(fd, fstat(fd).st_size, MAP_PRIVATE, PROT_READ)

data=[]
while True:
line = m.readline()
if not line: break
data.extend(map(float, line.split()))

of course the speedup is because of mmap, not because of faster python
code; however, remember this is (once you've got rid of the evil eval)
an IO-bound task, so anything you do to speed up that (like the mmap)
is a gain. If mmap returned something you could iterate over, you
could probably shave a second off (I shaved 3 seconds of your example
with this, and your example shaved 11 seconds of the original---on my
machine, with my data, and my wife asking for the computer).

(I'd replace the map with a list comprehension as soon as the function
stopped being C)

I'd talk about numarray.memmap if I knew it were going to be useful,
but as I don't, I won't.



PS: use mmap! it's not the '70s any more!
 
C

Christopher T King

PS: use mmap! it's not the '70s any more!

That's what GNU/HURD thought, and look where it got them ;)

(HURD mmap()s entire partitions, and thusly can't access partitions
greater than 2GB, or whatever the usable address space may be.)

I highly doubt one would reach that 2GB limit in day-to-day file accesses,
though.
 
D

Darren Dale

actually, I find the following more readable, and even faster:
from mmap import mmap, MAP_PRIVATE, PROT_READ
from os import fstat

f = file('test.dat',mode='rt')
fd = f.fileno()
m = mmap(fd, fstat(fd).st_size, MAP_PRIVATE, PROT_READ)

data=[]
while True:
line = m.readline()
if not line: break
data.extend(map(float, line.split()))
I will try this out. I am reading three sets of data from a file with
lengthy headers, so it looks like mmap is a really good solution. Thanks
to Jeff and Chris as well for teaching me something new.

One more thing, I am reading into arrays that can be 5000 cells wide,
and arbitrarily long (time-resolved scientific data.) The datafiles are
reorganized such that only 16 columns are listed on a line, and a '\'
character indicates that the row continues on the next line of the file.
Do you have ideas of how to quickly reconstruct these rows? I think mmap
gets me half the way there, but should I try to avoid testing each
readline for the presence of a '\' character?
 
J

John Lenton

One more thing, I am reading into arrays that can be 5000 cells wide,
and arbitrarily long (time-resolved scientific data.) The datafiles are
reorganized such that only 16 columns are listed on a line, and a '\'
character indicates that the row continues on the next line of the file.
Do you have ideas of how to quickly reconstruct these rows? I think mmap
gets me half the way there, but should I try to avoid testing each
readline for the presence of a '\' character?

you can either build the string and then split it again,

line = m.readline()
while line.endswith("\\\n"):
line = line[:-2] + m.readline()

Or, you could build the array bit by bit, which gets a bit messy but
might be faster (you'd have to test it)

more = True
while more:
row = array()
while True:
line = m.readline()
if not line:
more = False # goto end
row.extend(map(float, line.split()[:16]))
if not line.endswith("\\\n"):
break
data.append(row)

as usual in this kind of example, the code is very fragile and you'll
want to generalize it a bit before using it "production".
 
S

Scott David Daniels

John said:
PS: use mmap! it's not the '70s any more!
This is hilarious. I first learned about and used memory-mapped I/O in
the 70s on Tenex systems. One of our SAIL compiler's great accesses to
speed was that it secretly did all file I/O with memory mapped files.
The system call, or "JSYS" more precisely, was named MMAP. I think the
odds are at least even that MMAP was _invented_ in the 70s.
 
D

Darren Dale

If mmap returned something you could iterate over, you
could probably shave a second off (I shaved 3 seconds of your example
with this, and your example shaved 11 seconds of the original---on my
machine, with my data, and my wife asking for the computer).

mmap turned out to work really well for me. I cut the time down again by
writing to a buffer 100 rows long, and appending when the buffer fills.
Especially when an array is big, it costs a lot of time to reallocate
the memory required to grow an array.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,202
Messages
2,571,057
Members
47,661
Latest member
sxarexu

Latest Threads

Top