write a 20GB file

J

Jackie Lee

Hello there,

I have a 22 GB binary file, a want to change values of specific
positions. Because of the volume of the file, I doubt my code a
efficient one:

#! /usr/bin/env python
#coding=utf-8
import sys
import struct

try:
f=open(sys.argv[1],'rb+')
except (IOError,Exception):
print '''usage:
scriptname segyfilename
'''
sys.exit(1)

#skip EBCDIC header
try:
f.seek(3200)
except Exception:
print 'Oops! your file is broken..'

#read binary header
binhead = f.read(400)
ns = struct.unpack('>h',binhead[20:22])[0]
if ns < 0:
print 'file read error'
sys.exit(1)

#read trace header
while True:
f.seek(28,1)
f.write(struct.pack('>h',1))
f.seek(212,1)
f.seek(ns*4,1)

f.close()
 
D

Dave Angel

Jackie said:
Hello there,

I have a 22 GB binary file, a want to change values of specific
positions. Because of the volume of the file, I doubt my code a
efficient one:

#! /usr/bin/env python
#coding=utf-8
import sys
import struct

try:
f=open(sys.argv[1],'rb+')
except (IOError,Exception):
print '''usage:
scriptname segyfilename
'''
sys.exit(1)

#skip EBCDIC header
try:
f.seek(3200)
except Exception:
print 'Oops! your file is broken..'

#read binary header
binhead = f.read(400)
ns = struct.unpack('>h',binhead[20:22])[0]
if ns < 0:
print 'file read error'
sys.exit(1)

#read trace header
while True:
f.seek(28,1)
f.write(struct.pack('>h',1))
f.seek(212,1)
f.seek(ns*4,1)

f.close()
I don't see a question anywhere. So perhaps you just want comments on
your code.

1) How do you plan to test this?
2) Consider doing a lot more checking to see that you have in fact a
file of the right type.
3) Fix indentation - perhaps you've accidentally used a tab in the source.
4) Provide a termination condition for the while True loop, which
currently will (I think) go forever, or perhaps until the disk fills up.
5) Depending on the purpose of this file, you should consider making the
changes on a copy, then deleting and renaming. As it stands, if the
program gets aborted part way through, there's no way to know how far it
got. Since it's just clobbering bytes, it would be safe to rerun the
same program again, but many times that's not the case. And this
program clearly isn't finished yet, so perhaps it's not true here either.
6) I don't see anything inefficient about it. The nature of the problem
is going to be very slow (for small values of ns), but I don't know what
your code could do to speed it up. Perhaps make sure the file is on a
fast drive, and not RAID 5.

DaveA
 
J

Jackie Lee

Thx, Dave,

The code works fine. I just don't know how f.write works. It says that
file.write won't write the file until file.close or file.flush. So I
don't know if the following one is more efficient (sorry I forget to
add condition to break the loop):

#! /usr/bin/env python
#coding=utf-8
import sys
import struct

try:
f=open(sys.argv[1],'rb+')
except (IOError,Exception):
print '''usage:
scriptname segyfilename
'''
sys.exit(1)

#skip EBCDIC header
try:
f.seek(3200)
except Exception:
print 'Oops! your file is broken..'

#read binary header
binhead = f.read(400)
ns = struct.unpack('>h',binhead[20:22])[0]
if ns < 0:
print 'file read error'
sys.exit(1)

#read trace header
while True:
f.seek(28,1)
if f.read(2) == '':
break
f.seek(-2,1)
f.write(struct.pack('>h',1))
f.seek(210,1)
f.seek(ns*4,1)

f.close()


Jackie said:
Hello there,

I have a 22 GB binary file, a want to change values of specific
positions. Because of the volume of the file, I doubt my code a
efficient one:

#! /usr/bin/env python
#coding=utf-8
import sys
import struct

try:
       f=open(sys.argv[1],'rb+')
except (IOError,Exception):
   print '''usage:
       scriptname segyfilename
'''
   sys.exit(1)

#skip EBCDIC header
try:
   f.seek(3200)
except Exception:
   print 'Oops! your file is broken..'

#read binary header
binhead = f.read(400)
ns = struct.unpack('>h',binhead[20:22])[0]
if ns < 0:
   print 'file read error'
   sys.exit(1)

#read trace header
while True:
   f.seek(28,1)
   f.write(struct.pack('>h',1))
   f.seek(212,1)
   f.seek(ns*4,1)

f.close()

I don't see a question anywhere.  So perhaps you just want comments on your
code.

1) How do you plan to test this?
2) Consider doing a lot more checking to see that you have in fact a file of
the right type.
3) Fix indentation - perhaps you've accidentally used a tab in the source..
4) Provide a termination condition for the while True loop, which currently
will (I think) go forever, or perhaps until the disk fills up.
5) Depending on the purpose of this file, you should consider making the
changes on a copy, then deleting and renaming.  As it stands, if the program
gets aborted part way through, there's no way to know how far it got.  Since
it's just clobbering bytes, it would be safe to rerun the same program
again, but many times that's not the case.  And this program clearly isn't
finished yet, so perhaps it's not true here either.
6) I don't see anything inefficient about it.  The nature of the problem is
going to be very slow (for small values of ns), but I don't know what your
code could do to speed it up.  Perhaps make sure the file is on a fast
drive, and not RAID 5.

DaveA
 
J

J

Thx, Dave,

The code works fine. I just don't know how f.write works. It says that
file.write won't write the file until file.close or file.flush. So I
don't know if the following one is more efficient (sorry I forget to
add condition to break the loop):

someone smarter than me can correct me, but file.write() will write
when it's buffer is filled, or close() or flush() are called.
I don't know what the default buffer size for file.write() is though.
close() flushes the buffer before closing the file, and flush()
flushes the buffer and leaves the file open for further writing.
try:
       f=open(sys.argv[1],'rb+')
except (IOError,Exception):
   print '''usage:
       scriptname segyfilename
'''

You can just add a f.flush() every time you write to the file, but, I
tend to open files with 0 buffer size like this:

f = open(filename,"rb+",0)

Then again, I don't deal with files of that size, so there could be a
problem with my way once you start scaling up to the 20GB or larger
that you're working with.

Again, I could be wrong about all of that, so if so, I hope someone
will correct me and fix my understanding...

Cheers,

Jeff
 
M

Martin v. Loewis

The code works fine. I just don't know how f.write works. It says that
file.write won't write the file until file.close or file.flush.

You are misinterpreting the documentation. It certainly won't keep the
entire file in memory. Instead, it has a fixed-size buffer (something
like 8kiB or 32kiB) in which it writes and which it flushes when that
buffer is full.

The comment about flush and close merely refers to the problem that some
data may still be in the buffer at any point in time, unless you just
called close or flush.

HTH,
Martin
 
M

Martin v. Loewis

The code works fine. I just don't know how f.write works. It says that
file.write won't write the file until file.close or file.flush.

You are misinterpreting the documentation. It certainly won't keep the
entire file in memory. Instead, it has a fixed-size buffer (something
like 8kiB or 32kiB) in which it writes and which it flushes when that
buffer is full.

The comment about flush and close merely refers to the problem that some
data may still be in the buffer at any point in time, unless you just
called close or flush.

HTH,
Martin
 
N

Nobody

someone smarter than me can correct me, but file.write() will write when
it's buffer is filled, or close() or flush() are called.

And, in all probability, seek() will either flush it immediately or cause
the next write() to flush it before writing anything.
 
J

J

And, in all probability, seek() will either flush it immediately or cause
the next write() to flush it before writing anything.

Ahhh... I didn't know that... I thought seek() just moved the pointer
through the file a little further....

Cool.
 
J

Jackie Lee

Thanks to y'all. I should have be more careful reading the documentation.

Cheers
 
N

Nobody

Ahhh... I didn't know that... I thought seek() just moved the pointer
through the file a little further....

Think about how this affects buffering. write() writes at the current file
position. If you write, then seek, then write, it can't just concatenate
the two sets of data, as that would "lose" the seek.

Either the buffer has to contain multiple, distinct sets of data, each
with an associated position, or (far more likely), the original data must
be written to the correct location before the second set of data can be
stored.
 
D

Dave Angel

Nathan said:
This is precisely the situation mmap was made for :) It has almost the same
methods as a file so it should be an easy replacement.

<snip>

Only on a 64bit system, and I'm not sure it's even possible there in
every case. On a 32bit system, it would be impossible to mmap a 20gb
file. You only have 4gb of address space to play with, total.

DaveA
 
P

Patrick Maupin

Only on a 64bit system, and I'm not sure it's even possible there in
every case.  On a 32bit system, it would be impossible to mmap a 20gb
file.  You only have 4gb of address space to play with, total.

DaveA
 
P

Patrick Maupin

Only on a 64bit system, and I'm not sure it's even possible there in
every case.  On a 32bit system, it would be impossible to mmap a 20gb
file.  You only have 4gb of address space to play with, total.

DaveA

Well, depending on the OS, I think you could have multiple mappings
per file. So you could maintain your own mapping cache. That could
get a bit ugly, but depending on what you are doing, it might not be
too bad.

Regards,
Pat
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top