S
Slaunger
Hi comp.lang.python,
I am a novice Python programmer working on a project where I deal with
large binary files (>50 GB each)
consisting of a series of variable sized data packets.
Each packet consists of a small header with size and other information
and a much large payload containing the actual data.
Using Python 2.5, struct and numpy arrays I am capable of parsing such
a file quite efficiently into Header and Payload objects which I then
manipulate in various ways.
The most time consuming part of the parsing is the conversion of a
proprietary form of 32 bit floats into the IEEE floats used internally
in Python in the payloads.
For many use cases I am actually not interested in doing the parsing
of the payload right when I pass through it, as I may want to use the
attributes of the header to select the 1/1000 payload which I actually
have to look into the data for and do the resourceful float
conversion.
I would therefore like to have two variants of a Payload class. One
which is instantiated right away with the payload being parsed up in
the float arrays available as instance attributes and another variant,
where the Payload object at the time of instantiation only contains a
pointer to the place (f.tell()) in file where the payload begins. Only
when the non-existing attribute for a parsed up module is actully
accessed should the data be read, parsed up and the attribute created.
In pseudocode:
class PayloadInstant(object):
"""
This is a normal Payload, where the data are parsed up when
instantiated
"""
@classmethod
def read_from_file(cls, f, size):
"""
Returns a PayloadInstant instance with float data parsed up
and immediately accessible in the data attribute.
Instantiation
is slow but after instantiation, access is fast.
"""
def __init___(self, the_data):
self.data = the_data
class PayloadOnDemand(object):
"""
Behaves as a PayloadInstant object, but instantiation is faster
as only the position of the payload in the file is stored
initially in the object.
Only when acessing the initially non-existing data attribute
are the data actually read and the attribure created and bound to
the instance.
This will actually be a little slower than in PayloadInstant as
the correct file position
has to be seeked out first.
On later calls the object has as efficient attribute access as
PayloadInstant
"""
@classmethod
def read_from_file(cls, f, size):
pos = f.tell()
f.seek(pos + size) #Skip to end of payload
return cls(pos)
# I probably need some __getattr__ or __getattribute__ magic
here...??
def __init__(self, a_file_position):
self.file_position = a_file_position
My question is this a a pyhtonic way to do it, and they I would like a
hint as to how to make the hook
inside the PayloadOnDemand class, such that the inner lazy creation of
the attribute is completely hidden from the outside.
I guess I could also just make a single class, and let an OnDemand
attribute decide how it should behave.
My real application is considerably more complicated than this, but I
think the example grasps the problem in a nutshell.
-- Slaunger
I am a novice Python programmer working on a project where I deal with
large binary files (>50 GB each)
consisting of a series of variable sized data packets.
Each packet consists of a small header with size and other information
and a much large payload containing the actual data.
Using Python 2.5, struct and numpy arrays I am capable of parsing such
a file quite efficiently into Header and Payload objects which I then
manipulate in various ways.
The most time consuming part of the parsing is the conversion of a
proprietary form of 32 bit floats into the IEEE floats used internally
in Python in the payloads.
For many use cases I am actually not interested in doing the parsing
of the payload right when I pass through it, as I may want to use the
attributes of the header to select the 1/1000 payload which I actually
have to look into the data for and do the resourceful float
conversion.
I would therefore like to have two variants of a Payload class. One
which is instantiated right away with the payload being parsed up in
the float arrays available as instance attributes and another variant,
where the Payload object at the time of instantiation only contains a
pointer to the place (f.tell()) in file where the payload begins. Only
when the non-existing attribute for a parsed up module is actully
accessed should the data be read, parsed up and the attribute created.
In pseudocode:
class PayloadInstant(object):
"""
This is a normal Payload, where the data are parsed up when
instantiated
"""
@classmethod
def read_from_file(cls, f, size):
"""
Returns a PayloadInstant instance with float data parsed up
and immediately accessible in the data attribute.
Instantiation
is slow but after instantiation, access is fast.
"""
def __init___(self, the_data):
self.data = the_data
class PayloadOnDemand(object):
"""
Behaves as a PayloadInstant object, but instantiation is faster
as only the position of the payload in the file is stored
initially in the object.
Only when acessing the initially non-existing data attribute
are the data actually read and the attribure created and bound to
the instance.
This will actually be a little slower than in PayloadInstant as
the correct file position
has to be seeked out first.
On later calls the object has as efficient attribute access as
PayloadInstant
"""
@classmethod
def read_from_file(cls, f, size):
pos = f.tell()
f.seek(pos + size) #Skip to end of payload
return cls(pos)
# I probably need some __getattr__ or __getattribute__ magic
here...??
def __init__(self, a_file_position):
self.file_position = a_file_position
My question is this a a pyhtonic way to do it, and they I would like a
hint as to how to make the hook
inside the PayloadOnDemand class, such that the inner lazy creation of
the attribute is completely hidden from the outside.
I guess I could also just make a single class, and let an OnDemand
attribute decide how it should behave.
My real application is considerably more complicated than this, but I
think the example grasps the problem in a nutshell.
-- Slaunger