reading large file

G

guillaume

I have to read and process a large ASCII file containing a mesh : a
list of points and triangles.
The file is 100 MBytes.

I first tried to do it in memory but I think I am running out of
memory therefore I decide to use the shelve
module to store my points and elements on disks.
Despite the fact it is slow ... Any hint ? I think I have the same
memory problem but I don't understand why
since my aPoint should be removed by the gc.

Have you any idea ?

Thanks

Guillaume

PS :
here is the code for your info




import string
import os
import sys
import time
import resource
import shelve
import psyco

psyco.full()

class point:
def __init__(self,x,y,z):
self.x = x
self.y = y
self.z = z


def SFMImport(filename):
print 'UNV Import ("%s")' % filename

db = shelve.open('points.db')

file = open(filename, "r")

linenumber = 1
nbpoints = 0
nbfaces = 0

pointList = []
faceList = []

line = file.readline()
words = string.split(line)
nbpoints = string.atoi(words[1])
nbtrias = string.atoi(words[0])

print "found %s points and %s triangles" % (nbpoints, nbtrias)

t1 = time.time()
for i in range(nbpoints):
line = file.readline()
words = string.split(line)

x = string.atof(words[1].replace("D","E"))
y = string.atof(words[2].replace("D","E"))
z = string.atof(words[3].replace("D","E"))

aPoint = point(x, y, z)

as = "point%s" % i

if (i%250000 == 0):
print "%7d points <%s>" % (i, time.time() - t1)
t1 = time.time()

db[as] = aPoint

print "%s points read in %s seconds" % (nbpoints, time.time() - t1)
bd.close()

t1 = time.time()
t2 = time.time()
for i in range(nbtrias):
line = file.readline()
words = string.split(line)

i1 = string.atoi(words[0])
i2 = string.atoi(words[1])
i3 = string.atoi(words[2])

faceList.append((i1,i2,i3))

if (i%100000 == 0):
print "%s faces <%s>" % (i, time.time() - t1)
t1 = time.time()

print "%s points read in %s seconds" % (nbpoints, time.time() - t2)

file.close()

def callback(fs):
filename = fs.filename
UNVImport(filename)


if __name__ == "__main__":
# try:
# import GUI
# except:
# print "This script is only working with the new GUI module
...."
# else:
# fs = GUI.FileSelector()
# fs.activate(callback, fs)
print sys.argv[0]
SFMImport(sys.argv[1])
 
M

Michael Peuser

guillaume said:
I have to read and process a large ASCII file containing a mesh : a
list of points and triangles.
The file is 100 MBytes.

I first tried to do it in memory but I think I am running out of
memory therefore I decide to use the shelve
module to store my points and elements on disks.
Despite the fact it is slow ... Any hint ? I think I have the same
memory problem but I don't understand why
since my aPoint should be removed by the gc.

What do you expect from shelve? I should recommend you convert your data in
afirst pass into a binary format (doing all this atoi() in this pre-pass)
Then use memory mapped file access when reading it for your work pass.

But maybe you need a lot of memory for your internal structure as well. If
youe have a small RAM <512 MB the system could do a lot of swapping. You
will notice that when processor load goes down! The cheapest solution
generally is doubling your RAM.

Kindly
Michael P
 
P

Paul Rubin

print "found %s points and %s triangles" % (nbpoints, nbtrias)

t1 = time.time()
for i in range(nbpoints):

For another thing, use xrange instead of range here.
 
B

Bengt Richter

I have to read and process a large ASCII file containing a mesh : a
list of points and triangles.
The file is 100 MBytes.

I first tried to do it in memory but I think I am running out of
memory therefore I decide to use the shelve
module to store my points and elements on disks.
Despite the fact it is slow ... Any hint ? I think I have the same
memory problem but I don't understand why
since my aPoint should be removed by the gc.

Have you any idea ?
Since your data is very homogeneous, why don't you store it in a couple of
homogeneous arrays? You could easily create a class to give you convenient
access via indices or iterators etc. Also you could write load and store
methods that could write both arrays in binary to a file. You could
consider doing this as a separate conversion from your source file, and
then run your app using the binary files and wrapper class.

Arrays are described in the array module docs ;-)
I imagine you'd want to use the 'd' type for ponts and 'l' for faces.

Regards,
Bengt Richter
 
S

Sophie Alléon

Thanks to your comments, it is now possible to read my large file in a
couple of minutes
on my machine.

Guillaume
 
B

Bengt Richter

Thanks to your comments, it is now possible to read my large file in a
couple of minutes
on my machine.

Guillaume
</topPostText>

Well, so long as you're happy, glad to have played a role ;-)

But I would think that time could still be cut a fair amount. E.g., I imagine just copying
your file at the command line might take 20-25 sec, depending on your system,
and if you have a fast processor, you should be i/o bound a lot, so a lot of
the conversions etc. should be able to happen mostly while waiting for the disk.

There doesn't seem to be any way to tell the array module an estimated full (over or exact)capacity
for an array yet to be populated, but I would think such a feature in the array module would be good
for your kind of application. (Of course, hopefully the fromfile method increases size with a single
memory allocation, but you can't use that if your data requires conversion or filtering (scanf/printf
per-line conversion from/to ascii files might be another useful feature?)).

Anyway, even as is, I'd bet we could get the time down to under a minute, if it was important.
Of course, a couple of minutes is not bad if you're not going to do it over and over.

Regards,
Bengt Richter
 
A

Adam Przybyla

guillaume said:
I have to read and process a large ASCII file containing a mesh : a
list of points and triangles.
The file is 100 MBytes.

I first tried to do it in memory but I think I am running out of
memory therefore I decide to use the shelve
module to store my points and elements on disks.
Despite the fact it is slow ... Any hint ? I think I have the same
memory problem but I don't understand why
since my aPoint should be removed by the gc.
Have you any idea ?
... try PyTables;-) Regards
Adam Przybyla
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,102
Messages
2,570,646
Members
47,247
Latest member
GabrieleL2

Latest Threads

Top