How to instantiate in a lazy way?

S

Slaunger

Hi comp.lang.python,

I am a novice Python programmer working on a project where I deal with
large binary files (>50 GB each)
consisting of a series of variable sized data packets.

Each packet consists of a small header with size and other information
and a much large payload containing the actual data.

Using Python 2.5, struct and numpy arrays I am capable of parsing such
a file quite efficiently into Header and Payload objects which I then
manipulate in various ways.

The most time consuming part of the parsing is the conversion of a
proprietary form of 32 bit floats into the IEEE floats used internally
in Python in the payloads.

For many use cases I am actually not interested in doing the parsing
of the payload right when I pass through it, as I may want to use the
attributes of the header to select the 1/1000 payload which I actually
have to look into the data for and do the resourceful float
conversion.

I would therefore like to have two variants of a Payload class. One
which is instantiated right away with the payload being parsed up in
the float arrays available as instance attributes and another variant,
where the Payload object at the time of instantiation only contains a
pointer to the place (f.tell()) in file where the payload begins. Only
when the non-existing attribute for a parsed up module is actully
accessed should the data be read, parsed up and the attribute created.

In pseudocode:

class PayloadInstant(object):
"""
This is a normal Payload, where the data are parsed up when
instantiated
"""

@classmethod
def read_from_file(cls, f, size):
"""
Returns a PayloadInstant instance with float data parsed up
and immediately accessible in the data attribute.
Instantiation
is slow but after instantiation, access is fast.
"""

def __init___(self, the_data):
self.data = the_data

class PayloadOnDemand(object):
"""
Behaves as a PayloadInstant object, but instantiation is faster
as only the position of the payload in the file is stored
initially in the object.
Only when acessing the initially non-existing data attribute
are the data actually read and the attribure created and bound to
the instance.
This will actually be a little slower than in PayloadInstant as
the correct file position
has to be seeked out first.
On later calls the object has as efficient attribute access as
PayloadInstant
"""

@classmethod
def read_from_file(cls, f, size):
pos = f.tell()
f.seek(pos + size) #Skip to end of payload
return cls(pos)

# I probably need some __getattr__ or __getattribute__ magic
here...??

def __init__(self, a_file_position):
self.file_position = a_file_position

My question is this a a pyhtonic way to do it, and they I would like a
hint as to how to make the hook
inside the PayloadOnDemand class, such that the inner lazy creation of
the attribute is completely hidden from the outside.

I guess I could also just make a single class, and let an OnDemand
attribute decide how it should behave.

My real application is considerably more complicated than this, but I
think the example grasps the problem in a nutshell.

-- Slaunger
 
S

Slaunger

Slaunger said:
class PayloadOnDemand(object):
    """
    Behaves as a PayloadInstant object, but instantiation is faster
    as only the position of the payload in the file is stored
initially in the object.
    Only when acessing the initially non-existing data attribute
    are the data actually read and the attribure created and bound to
the instance.
    This will actually be a little slower than in PayloadInstant as
the correct file position
    has to be seeked out first.
    On later calls the object has as efficient attribute access as
PayloadInstant
    """

    @classmethod
    def read_from_file(cls, f, size):
        pos = f.tell()
        f.seek(pos + size) #Skip to end of payload
        return cls(pos)

Extend with ref to file instead:
return cls(f, pos)
    # I probably need some __getattr__ or __getattribute__ magic
# there...??

To answer my own rethorical question I guess I should do something
like this

def __getattr__(self, attr_name):
"""
Only called if attr_name is not in the __dict__ for the
instance
"""
if attr_name == 'data':
self.__dict__[attr_name] = read_data(self.f,
self.file_position)
    def __init__(self, a_file_position):
        self.file_position = a_file_position
and then I need to also store a reference to the file in the
constructor...

def __init__(self, a_file, a_file_position):
self.f = a_file
self.file_position = a_file_position

Have I understood correctly how to to it the on demand way?

-- Slaunger
 
S

Slaunger

I wouldn't use __getattr__ unless you've got lots of attributes to
overload.  __getattr__ is a recipe for getting yourself into trouble
in my experience ;-)

Just do it like this...

class PayloadOnDemand(object):
      def __init__(self, a_file, a_file_position):
          self._data = None
          self.f = a_file
          self.file_position = a_file_position

      @property
      def data(self):
          if self._data is None:
              self._data = self.really_read_the_data()
          return self._data

then you'll have a .data attribute which when you read it for the
first time it will populate itself.

If None is a valid value for data then make a sentinel, eg

class PayloadOnDemand(object):
      sentinel = object()

      def __init__(self, a_file, a_file_position):
          self._data = self.sentinel
          self.f = a_file
          self.file_position = a_file_position

      @property
      def data(self):
          if self._data is self.sentinel:
              self._data = self.really_read_the_data()
          return self._data

OK, I get it. In my case I have four attributes to create when one of
them is accessed, I do not know if that is a lot of attributes;-) One
thing I like about the __getattr__ is that it is only called that one
single time where an attempt to read a data attribute fails because
the attribute name is not defined in the __dict__ of the object.

With the property methology you do the value check on each get, which
does not look as "clean". The property methology is also a little less
arcane I guess for less experienced Python programmers to understand
when re-reading the code.

What kind of trouble are you referring to in __getattr__? Is it
recursive calls to the method on accessing object attributes in that
method itself or other complications?

On a related issue, thank you for showing me how to use @property as a
decorator - I was not aware of that possibility, just gotta understand
how to decorate a setter and delete method as well, but I should be
able to look that up by myself...

-- Slaunger
 
S

Slaunger

For 4 attributes I'd probably go with the __getattr__.
OK, I'll do that!
Or you could easily write your own decorator to cache the result...

Eghttp://code.activestate.com/recipes/363602/

Cool. I never realized I could write my own decorators!
I've so far only used them for
@classmethod, @staticmethod and stuff like that.
User defined decorators are nice and fun to do as well.
I just hope it will be understandable
in four years also...
Less magic is how I would put it.  Magic is fun to write, but a pain
to come back to.  Over the years I find I try to avoid magic more and
more in python.
Ah, I see. I hope you do not consider user defined decorators
"magic" then? ;-)
Every time I write a __getattr__ I get tripped up by infinite
recursion!  It is probably just me ;-)
And I will probably end up having the same initial problems, but I
found an example
here, which I may try to be inspired from.

http://western-skies.blogspot.com/2008/02/complete-example-of-getattr-in-python.html
Yeah, I just visited that page yesterday!

Again, thank you for your assistance, Nick!

-- Slaunger
 
S

Slaunger

Just wanted to show the end result in its actual implementation!

I ended up *not* making a decorator, as I already had a good idea
about how to do it
using __getattr__

class PayloadDualFrqIQOnDemand(PayloadDualFrqIQ):
"""
This class has the same interface as its parent,
but unlike its parent, it is instantiated without
its payload parsed up in its instance attributes
Q1, I1, Q2 and I2. Instead it stores a reference to
the file object in which the Payload data can be
read, the file position and
the version of the payload data.

On accessing one of the data attributes, the actual
payload data are read from the file, and the reference to
the file object is unbound.
The constructor signature is therefore different from its
parent as it takes the file object, position and version
as arguments instead of the actual data.
"""

@classmethod
def _unpack_from_file(cls, f, samples, ver):
bytes = samples * cls.bytes_per_sample
initial_pos = f.tell()
f.seek(initial_pos + bytes) #Skip over the payload
return cls(f, initial_pos, samples, ver)

@classmethod
def unpack_from_ver3_file(cls, f, samples):
return cls._unpack_from_file(f, samples, ver=3)

@classmethod
def unpack_from_ver4_file(cls, f, samples):
return cls._unpack_from_file(f, samples, ver=4)

data_attr_names = frozenset(["Q1", "I1", "Q2", "I2"])

def __init__(self, a_file, a_file_position, samples, a_version):
"""
Returns an instance where the object knows where to
look for the payload but it will only be loaded on the
first attempt to read one of the data attributes
in a "normal" PayloadDualFrqIQ object.
"""
self.f = a_file
self.file_position = a_file_position
self.samples = samples
self.ver = a_version

def __getattr__(self, attr_name):
"""
Checks if a request to read a non-existing data attribute
has an attribute corresponding to one of the data attributes
in a normal PayloadDualFrqIQ object.

If true, the data attributes are created and bound to the
object using the file object instance, the file position
and the version.

It is a prerequisite that the file object is still open.
The function leaves the file object at the file position
when it entered the method

"""
cls = self.__class__
if attr_name in cls.data_attr_names:
initial_pos = self.f.tell()
try:
bytes = self.samples * cls.bytes_per_sample
self.f.seek(self.file_position)
buf = self.f.read(bytes)
if self.ver == 3:
bytes_to_data = cls._v3_byte_str_to_data
elif self.ver == 4:
bytes_to_data = cls._v4_byte_str_to_data
else:
raise TermaNotImplemented, \
"Support for ver. %d not implemented." %
self.ver
I1, Q1, I2, Q2 = bytes_to_data(buf)
self.__dict__["I1"] = I1
self.__dict__["Q1"] = Q1
self.__dict__["I2"] = I2
self.__dict__["Q2"] = Q2
return self.__dict__[attr_name]
finally:
# Restore file position
self.f.seek(initial_pos)
# Unbind lazy attributes
del self.f
del self.ver
del self.file_position
del self.samples

This seems to work out well. No infinite loops in __getattr__!

At least it passes the unit test cases I have come up with so far.

No guarantees though, as I may simply not have been smart enough to
bring forth unit test cases which make it crash.

Comments on the code is still appreciated though.

I am still a novice Python programmer, and I may have overlooked
more Pythonic ways to do it.

-- Slaunger
 
G

George Sakkis

Just wanted to show the end result in its actual implementation!

I ended up *not* making a decorator, as I already had a good idea
about how to do it
using __getattr__

class PayloadDualFrqIQOnDemand(PayloadDualFrqIQ):
    """
    This class has the same interface as its parent,
    but unlike its parent, it is instantiated without
    its payload parsed up in its instance attributes
    Q1, I1, Q2 and I2. Instead it stores a reference to
    the file object in which the Payload data can be
    read, the file position and
    the version of the payload data.

    On accessing one of the data attributes, the actual
    payload data are read from the file, and the reference to
    the file object is unbound.
    The constructor signature is therefore different from its
    parent as it takes the file object, position and version
    as arguments instead of the actual data.
    """

    @classmethod
    def _unpack_from_file(cls, f, samples, ver):
        bytes = samples * cls.bytes_per_sample
        initial_pos = f.tell()
        f.seek(initial_pos + bytes) #Skip over the payload
        return cls(f, initial_pos, samples, ver)

    @classmethod
    def unpack_from_ver3_file(cls, f, samples):
        return cls._unpack_from_file(f, samples, ver=3)

    @classmethod
    def unpack_from_ver4_file(cls, f, samples):
        return cls._unpack_from_file(f, samples, ver=4)

    data_attr_names = frozenset(["Q1", "I1", "Q2", "I2"])

    def __init__(self, a_file, a_file_position, samples, a_version):
        """
        Returns an instance where the object knows where to
        look for the payload but it will only be loaded on the
        first attempt to read one of the data attributes
        in a "normal" PayloadDualFrqIQ object.
        """
        self.f = a_file
        self.file_position = a_file_position
        self.samples = samples
        self.ver = a_version

    def __getattr__(self, attr_name):
        """
        Checks if a request to read a non-existing data attribute
        has an attribute corresponding to one of the data attributes
        in a normal PayloadDualFrqIQ object.

        If true, the data attributes are created and bound to the
        object using the file object instance, the file position
        and the version.

        It is a prerequisite that the file object is still open.
        The function leaves the file object at the file position
        when it entered the method

        """
        cls = self.__class__
        if attr_name in cls.data_attr_names:
            initial_pos = self.f.tell()
            try:
                bytes = self.samples * cls.bytes_per_sample
                self.f.seek(self.file_position)
                buf = self.f.read(bytes)
                if self.ver == 3:
                    bytes_to_data = cls._v3_byte_str_to_data
                elif self.ver == 4:
                    bytes_to_data = cls._v4_byte_str_to_data
                else:
                    raise TermaNotImplemented, \
                        "Support for ver. %d not implemented." %
self.ver
                I1, Q1, I2, Q2 = bytes_to_data(buf)
                self.__dict__["I1"] = I1
                self.__dict__["Q1"] = Q1
                self.__dict__["I2"] = I2
                self.__dict__["Q2"] = Q2
                return self.__dict__[attr_name]
            finally:
                # Restore file position
                self.f.seek(initial_pos)
                # Unbind lazy attributes
                del self.f
                del self.ver
                del self.file_position
                del self.samples

This seems to work out well. No infinite loops in __getattr__!

At least it passes the unit test cases I have come up with so far.

No guarantees though, as I may simply not have been smart enough to
bring forth unit test cases which make it crash.

Comments on the code is still appreciated though.

A trivial improvement: replace
I1, Q1, I2, Q2 = bytes_to_data(buf)
self.__dict__["I1"] = I1
self.__dict__["Q1"] = Q1
self.__dict__["I2"] = I2
self.__dict__["Q2"] = Q2

with:

self.__dict__.update(zip(self.data_attr_names, bytes_to_data
(buf)))

where data_attr_names = ("I1", "Q1", "I2", "Q2") instead of a
frozenset. A linear search in a size-4 tuple is unlikely to be the
bottleneck with much I/O anyway.

George
 
S

Slaunger

                I1, Q1, I2, Q2 = bytes_to_data(buf)
                self.__dict__["I1"] = I1
                self.__dict__["Q1"] = Q1
                self.__dict__["I2"] = I2
                self.__dict__["Q2"] = Q2

with:

    self.__dict__.update(zip(self.data_attr_names, bytes_to_data
(buf)))

where data_attr_names = ("I1", "Q1", "I2", "Q2") instead of a
frozenset. A linear search in a size-4 tuple is unlikely to be the
bottleneck with much I/O anyway.

Thank you for this little hint, George.
I've never used update on a dict and the zip function before.
This is a nice application of these functions.

And I agree, performance is not an issue by selecting a tuple instead
of a frozenset. The bytes_to_data function is the performance
bottleneck
in the actual application (implemented in parent class).

-- Slaunger
 
S

Slaunger

self.data_attr_names should do instead of cls.data_attr_names unless
you are overriding it in the instance (which you don't appear to be).

Yeah, I know. I just like the cls notation for code readability
because it
tells you that it is a class attribute, which is not instance-
dependent.

That may be legacy from my Java past, where I used to do it that way.
I know perfectly well that self. would do it. i just find that
notation
a little misleading
                 I1, Q1, I2, Q2 = bytes_to_data(buf)
                 self.__dict__["I1"] = I1
                 self.__dict__["Q1"] = Q1
                 self.__dict__["I2"] = I2
                 self.__dict__["Q2"] = Q2
                 return self.__dict__[attr_name]

I think you want setattr() here - __dict__ is an implemetation detail
- classes with __slots__ for instance don't have a __dict__.  I'd
probably do this

Oh my, I did not know that. __slots__?? Something new I got to
understand.
But you are right. thanks!
                   for k, v in zip(("I1", "Q1", "I2", "Q2"), bytes_to_data(buf)):
                       setattr(self, k, v)
                   return object.__getattr__(self, attr_name)
And perhaps even more readable (how I do it now, no need to call
__getattr__ for an attribute, whcih is already there):
...
for attr, value in zip(cls.data_attr_names,
bytes_to_data(buf)):
setattr(self, attr, value)

return getattr(self, attr_name)

:)

I would probably factor out the contents of the if statement into a
seperate method, but that is a matter of taste!

Agreed. I thought about that myself for better code readability.

As a final comment, I have actually refacted the code quite a bit
as I have to do this ...OnDemand trick for several classes, which have
different data attributes with different names.

In this process I have actaully managed to isolate all the ...OnDemand
stuff
in an abstract PayloadOnDemand class

I can now use this "decorator-like"/helper class to very easily make
an ...OnDemand
variant of a class by just doing multiple inheritance - no
implementation:

class PayloadBaconAndEggsOnDemand(PayloadOnDemand,
PayloadBaconAndEggs): pass

I guess this somewhat resembles the decorator approach - I just could
not figure out
how to kake a general purpose decorator.

For this to actually work the "instant" PayloadBaconAndEggs class
simply has to define
and implement a few class attributes and static utility functions for
the unpacking.

-- Slaunger
 
S

Slaunger

I quite like it... It looks in the instance, then in the class which I
find to be very elegant - you can set a default in the class and
override it on a per object or per subclass basis.
In principle yes.

In the particular case in which it is used I happen to know that it
would not make sense to have a different attribute at the instance
level.

That is, however quite hard to realize for outside reviewers based on
the small snippets I have revealed here. So, i certainly understand
your view point.

The cls notation sort of emphasizes that instances are not supposed to
override it (for me at least), and if they did, it would be ignored.
In other applications, I would use self. too.

I wrote the object.__getattr__ call to stop recursion troubles.  If
you are sure you've set the attribute then plain getattr() is OK I
guess...

Ah, Ok. I am sure and my unit tests verify my assurance.
You've reinvented a Mixin class!

 http://en.wikipedia.org/wiki/Mixin

It is a good technique.

Wow, there is a name for it! It did not know that.

Hmm... I never really took the time to study those GoF design
patterns.

(I am a physicist after all... and really a programmer)

I guess I could save a lot of time constantly re-inventing the wheel.

Are there any good design pattern books focused on applications in
Python?
(Actually, I will post that question in a separate thread)

Once again, I am extremely pleased with your very thoughtful comments,
Nick. Thanks!

-- Slaunger
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,967
Messages
2,570,148
Members
46,694
Latest member
LetaCadwal

Latest Threads

Top