multi-Singleton-like using __new__

F

Freek Dijkstra

Is there a best practice on how to override __new__?

I have a base class, RDFObject, which is instantiated using a unique
identifier (a URI in this case). If an object with a given identifier
already exists, I want to return the existing object, otherwise, I
want to create a new object and add this new object to a cache. I'm
not sure if there is a name for such a creature, but I've seen the
name MultiSingleton in the archive.

This is not so hard; this can be done by overriding __new__(), as long
as I use a lock in case I want my code to be multi-threading
compatible.

import threading
threadlock = threading.Lock()

class RDFObject(object):
_cache = {} # class variable is shared among all RDFObject
instances
def __new__(cls, *args, **kargs):
assert len(args) >= 1
uri = args[0]
if uri not in cls._cache:
threadlock.acquire() # thread lock
obj = object.__new__(cls)
cls._cache[uri] = obj
threadlock.release() # thread unlock.
return cls._cache[uri]
def __init__(self, uri):
pass
# ...

However, I have the following problem:
The __init__-method is called every time you call RDFObject().

The benefit of this multi-singleton is that I can put this class in a
module, call RDFObject(someuri), and simply keep adding states to it
(which is what we want). If it had some state, good, that is retained.
If it did not have so: fine, we get a new object.
For example:

x = RDFObject(someuri)
x.myvar = 123
....later in the code...
y = RDFObject(someuri)
assert(y.myvar == 123)

I and fellow programmers tend to forget about the __init__() catch.
For example, when we subclass RDFObject:

class MySubclass(RDFObject):
def __init__(self, uri):
RDFObject.__init__(self, uri)
self.somevar = []

Now, this does not work. The array is unwantedly initialized twice:

x = RDFObject(someotheruri)
x.somevar.append(123)
....later in the code...
y = RDFObject(someotheruri)
assert(y.somevar[0] == 123)

So I'm wondering: is there a best practice that allows the behaviour
we're looking for? (I can think of a few things, but I consider them
all rather ugly). Is there a good way to suppress the second call
__init__() from the base class? Perhaps even without overriding
__new__?
 
G

Guilherme Polo

2008/2/8 said:
Is there a best practice on how to override __new__?

I have a base class, RDFObject, which is instantiated using a unique
identifier (a URI in this case). If an object with a given identifier
already exists, I want to return the existing object, otherwise, I
want to create a new object and add this new object to a cache. I'm
not sure if there is a name for such a creature, but I've seen the
name MultiSingleton in the archive.

This is not so hard; this can be done by overriding __new__(), as long
as I use a lock in case I want my code to be multi-threading
compatible.

import threading
threadlock = threading.Lock()

class RDFObject(object):
_cache = {} # class variable is shared among all RDFObject
instances
def __new__(cls, *args, **kargs):
assert len(args) >= 1
uri = args[0]
if uri not in cls._cache:
threadlock.acquire() # thread lock
obj = object.__new__(cls)
cls._cache[uri] = obj
threadlock.release() # thread unlock.
return cls._cache[uri]
def __init__(self, uri):
pass
# ...

However, I have the following problem:
The __init__-method is called every time you call RDFObject().

The benefit of this multi-singleton is that I can put this class in a
module, call RDFObject(someuri), and simply keep adding states to it
(which is what we want). If it had some state, good, that is retained.
If it did not have so: fine, we get a new object.
For example:

x = RDFObject(someuri)
x.myvar = 123
...later in the code...
y = RDFObject(someuri)
assert(y.myvar == 123)

I and fellow programmers tend to forget about the __init__() catch.
For example, when we subclass RDFObject:

class MySubclass(RDFObject):
def __init__(self, uri):
RDFObject.__init__(self, uri)
self.somevar = []

Now, this does not work. The array is unwantedly initialized twice:

x = RDFObject(someotheruri)
x.somevar.append(123)
...later in the code...
y = RDFObject(someotheruri)
assert(y.somevar[0] == 123)

So I'm wondering: is there a best practice that allows the behaviour
we're looking for? (I can think of a few things, but I consider them
all rather ugly). Is there a good way to suppress the second call
__init__() from the base class? Perhaps even without overriding
__new__?
Would something like this be acceptable ?

class memoize(object):
def __init__(self, func):
self._memoized = {}
self._func = func

def __get__(self, instance, *args):
self._instance = instance
return self

def __call__(self, *args, **kwargs):
uri = args[0]
if uri not in self._memoized:
self._memoized[uri] = self._func(self._instance, *args, **kwargs)

return self._memoized[uri]

class Memoize(type):
@memoize
def __call__(cls, *args, **kwargs):
return super(Memoize, cls).__call__(*args, **kwargs)

class RDFObject(object):
__metaclass__ = Memoize
def __init__(self, uri):
self.uri = uri
print self.__class__, uri

class Test2(RDFObject):
def __init__(self, uri):
super(Test2, self).__init__(uri)
self.mylist = []


x = Test2("oi")
print x.uri
x.mylist.append(32)

y = Test2("oi")
print y.uri
print y.mylist

z = Test2("oa")
print z.uri
print z.mylist

I haven't used this before, but looks like some people have done
similar things. There are some possible problems you will encounter
with this, if you create an instance of RDFObject("oa") after creating
an instance Test2("oa") it will contain mylist attribute (for
example).
 
A

Arnaud Delobelle

Is there a best practice on how to override __new__?

I have a base class, RDFObject, which is instantiated using a unique
identifier (a URI in this case). If an object with a given identifier
already exists, I want to return the existing object, otherwise, I
want to create a new object and add this new object to a cache. I'm
not sure if there is a name for such a creature, but I've seen the
name MultiSingleton in the archive.

This is not so hard; this can be done by overriding __new__(), as long
as I use a lock in case I want my code to be multi-threading
compatible.

import threading
threadlock = threading.Lock()

class RDFObject(object):
    _cache = {}   # class variable is shared among all RDFObject
instances
    def __new__(cls, *args, **kargs):
        assert len(args) >= 1
        uri = args[0]
        if uri not in cls._cache:
            threadlock.acquire() # thread lock
            obj = object.__new__(cls)
            cls._cache[uri] = obj
            threadlock.release() # thread unlock.
        return cls._cache[uri]
    def __init__(self, uri):
        pass
    # ...

However, I have the following problem:
The __init__-method is called every time you call RDFObject().

You need to override the __call__ method of the metaclass. By default
is calls cls.__new__ then cls.__init__, e.g.:

class RDFObject(object):

# This metaclass prevents __init__ from being automatically called
# after __new__
class __metaclass__(type):
def __call__(cls, *args, **kwargs):
return cls.__new__(cls, *args, **kwargs)
# ...

HTH
 
J

J Peyret

I think the metaclass stuff is a bit too black magic for a pretty
simple requirement.

Txs in any case for showing me the __init__ issue, I wasn't aware of
it.

Here's a workaround - not exactly elegant in terms of OO, with the
isInitialized flag, but it works.
.... cache ={}
.... isInitialized = False
.... def __new__(cls,uri,*args,**kwds):
.... try:
.... return cls.cache[uri]
.... except KeyError:
.... print "cache miss"
.... res = cls.cache[uri] = object.__new__(cls)
.... return res
.... def __init__(self,uri):
.... if self.isInitialized:
.... return
.... print "__init__ for uri:%s" % (uri)
.... self.isInitialized = True
....cache miss
__init__ for uri:1cache miss
__init__ for uri:2<__main__.RDFObject object at 0x87a9f8c>

Some things to keep in mind:

- Might as well give uri its place as a positional param. Testing
len(*args) is hackish, IMHO.

- Same with using try/except KeyError instead of in cls.cache.
Has_key might be better if you insist on look-before-you-leap, because
'in cls.cache' probably expends to uri in cls.cache.keys(), which can
be rather bad for perfs if the cache is very big. i.e. dict lookups
are faster than scanning long lists.

- I took out the threading stuff - dunno threading and I was curious
if that was causing __init__ twice. It wasn't, again txs for showing
me something I dinna know.

- isInitialized is as class variable. __init__ looks it up from the
class on new instances, but immediately rebinds it to the instance
when assigning self.isInitialized = True. On an instance that gets re-
__init__-ed, self.isInitialized exists, so the lookup doesn't
propagate up to the class variable (which is still False).

Cheers
 
M

Matt Nordhoff

J said:
- Same with using try/except KeyError instead of in cls.cache.
Has_key might be better if you insist on look-before-you-leap, because
'in cls.cache' probably expends to uri in cls.cache.keys(), which can
be rather bad for perfs if the cache is very big. i.e. dict lookups
are faster than scanning long lists.

Not true. 'in' is (marginally) faster than has_key. It's also been
preferred to has_key for a while now.

$ python -m timeit -s "d = dict.fromkeys(xrange(5))" "4 in d"
1000000 loops, best of 3: 0.233 usec per loop
$ python -m timeit -s "d = dict.fromkeys(xrange(5))" "d.has_key(4)"
1000000 loops, best of 3: 0.321 usec per loop
$ python -m timeit -s "d = dict.fromkeys(xrange(500000))" "499999 in d"
1000000 loops, best of 3: 0.253 usec per loop
$ python -m timeit -s "d = dict.fromkeys(xrange(500000))"
"d.has_key(499999)"
1000000 loops, best of 3: 0.391 usec per loop
$ python -m timeit -s "d = dict.fromkeys(xrange(500000))" "1000000 in d"
1000000 loops, best of 3: 0.208 usec per loop
$ python -m timeit -s "d = dict.fromkeys(xrange(500000))"
"d.has_key(1000000)"
1000000 loops, best of 3: 0.324 usec per loop

FWIW, as comparison:

$ python -m timeit -s "l = range(500000)" "0 in l"
10000000 loops, best of 3: 0.198 usec per loop
$ python -m timeit -s "l = range(500000)" "499999 in l"
10 loops, best of 3: 19.8 msec per loop

(Python 2.5.1, Ubuntu. Of course, timings vary a bit, but not much. At
worst, in and has_key are "about the same".)
--
 
S

Steven D'Aprano

At worst, in and has_key are "about the same".

Except that using has_key() means making an attribute lookup, which takes
time.

I'm kinda curious why you say they're "about the same" when your own
timing results contradict that. Here they are again, exactly as you
posted them:


$ python -m timeit -s "d = dict.fromkeys(xrange(5))" "4 in d"
1000000 loops, best of 3: 0.233 usec per loop
$ python -m timeit -s "d = dict.fromkeys(xrange(5))" "d.has_key(4)"
1000000 loops, best of 3: 0.321 usec per loop

For a small dict, a successful search using in is about 1.3 times faster
than using has_key().


$ python -m timeit -s "d = dict.fromkeys(xrange(500000))" "499999 in d"
1000000 loops, best of 3: 0.253 usec per loop
$ python -m timeit -s "d = dict.fromkeys(xrange(500000))"
"d.has_key(499999)"
1000000 loops, best of 3: 0.391 usec per loop

For a large dict, a successful search using in is about 1.5 times faster
than using has_key().


$ python -m timeit -s "d = dict.fromkeys(xrange(500000))" "1000000 in d"
1000000 loops, best of 3: 0.208 usec per loop
$ python -m timeit -s "d = dict.fromkeys(xrange(500000))"
"d.has_key(1000000)"
1000000 loops, best of 3: 0.324 usec per loop

For a large dict, an unsuccessful search using in is also about 1.5 times
faster than using has_key().


Or, to put it another way, has_key() takes about 40-60% longer than in.

Now, if you want to argue that the difference between 0.3 microseconds
and 0.2 microseconds is insignificant, I'd agree with you -- for a single
lookup. But if you have a loop where you're doing large numbers of
lookups, using in will be a significant optimization.
 
G

Gabriel Genellina

En Fri, 08 Feb 2008 22:04:26 -0200, Matt Nordhoff
Not true. 'in' is (marginally) faster than has_key. It's also been
preferred to has_key for a while now.

I don't understand your assertion, given your own timings below. I'd say
that `has_key` is about 50% slower than `in`, not "marginally"
 
M

Matt Nordhoff

Steven said:
Except that using has_key() means making an attribute lookup, which takes
time.

I was going to say that, but doesn't 'in' require an attribute lookup of
some sort too, of __contains__ or whatever? has_key is probably now just
a wrapper around that, so it would be one more attribute lookup. What's
the difference in performance between attribute lookups from Python code
and from the internal C code?
Now, if you want to argue that the difference between 0.3 microseconds
and 0.2 microseconds is insignificant, I'd agree with you -- for a single
lookup. But if you have a loop where you're doing large numbers of
lookups, using in will be a significant optimization.

I meant that it's (usually) insignificant, and barely larger than the
usual inaccuracy due to my system's load (I'm tired; what's the word for
that?).
--
 
F

Freek Dijkstra

J said:
... cache ={}
... isInitialized = False
... def __new__(cls,uri,*args,**kwds):
... try:
... return cls.cache[uri]
... except KeyError:
... print "cache miss"
... res = cls.cache[uri] = object.__new__(cls)
... return res
... def __init__(self,uri):
... if self.isInitialized:
... return
... print "__init__ for uri:%s" % (uri)
... self.isInitialized = True
...>>> r1 = RDFObject(1)

Thanks. I first got that option, but discarded it because of the
following (sloppy) coding, which still leads to double
initializations:

class MySubclass(RDFObject):
def __init__(self, uri):
self.somevar = []
RDFObject.__init__(self, uri)
# ....

In this case, somevar is re-initialized before the return statement in
the parent.

I'll use __metaclass__ solution. Here is my (non-threaded) version:

class RDFObject(object):
_cache = {} # class variable is shared among all RDFObject
instances
class __metaclass__(type):
def __call__(cls, *args, **kwargs):
return cls.__new__(cls, *args, **kwargs)
def __new__(cls, uri, *args, **kargs):
if uri not in cls._cache:
obj = object.__new__(cls)
cls._cache[uri] = obj
obj.__init__(uri, *args, **kargs)
return cls._cache[uri]
def __init__(self, uri):
self.uri = uri
print self.__class__, uri
# ...

Thanks for the other 'things to keep in mind'. I found those all very
informative!

Two more things for all lurkers out there:

- If you override __call__ in __metaclass__ to only call __new__, but
skip on __init__, you must make sure to call __init__ elsewhere. For
example, from __new__:
if uri not in cls._cache:
obj = object.__new__(cls)
cls._cache[uri] = obj
obj.__init__(uri, *args, **kargs)

- My "thread safe" code was in fact NOT thread safe. It still can
create the same object twice in the (very unlikely event) that thread
#2 executes "cls._cache[uri] = obj" just between the lines "if uri not
in cls._cache:" and "threadlock.acquire()" in thread #1. (bonus points
for those who caught this one).

Perhaps indeed the try...except KeyError is even prettier (no idea
about speed, but let's rename this thread if we want to discuss
performance measurements).

Thanks all!
Freek
 
S

Steven D'Aprano

I was going to say that, but doesn't 'in' require an attribute lookup of
some sort too, of __contains__ or whatever?


I don't believe so. I understand that the in operator at the C level
knows about built-in types, and does something like this pseudo-code:

case type(source):
dict: return MAP_IN(source, target)
list: return LIST_IN(source, target)
str: return STR_IN(source, target)
otherwise:
return source.__contains__(target)

Except all in C, so it's fast.

I understand that the dict has_key() method is essentially a wrapper
around MAP_IN (or whatever it's actually called), and so is also fast,
just not quite as fast because of the relatively slow look-up to get to
has_key() in the first place.
 
J

J Peyret

If you want to subclass, my initial example did not cover that. This
will, or at least, I don't have any problems with similar code:

... def __new__(cls,uri,*args,**kwds):
.... try:
.... return cls.cache[(cls,uri)] #notice that I added the
class itself as a key.
.... except KeyError:
.... print "cache miss"
.... res = cls.cache[(cls,uri)] = object.__new__(cls)
.... return res

Your mileage apparently varies, but I'm still not sold on using a
metaclass. Just because it is a bit fancy for the requirements,
IMHO. Later you may have cached/singleton classes that really need a
metaclass for something else, with subclasses of _those_ that don't
want caching. That's definitely just my $.02 there, I don't
necessarily expect others to share my prejudices and misgivings.

Anyway, I should post answers more often - I learned more today than
by asking questions ;-) Txs all, for the has_key/in performance
pointers, they are exactly contrary to what I would have thought.

Cheers
 
H

Hrvoje Niksic

Matt Nordhoff said:
I was going to say that, but doesn't 'in' require an attribute lookup of
some sort too, of __contains__ or whatever?

It doesn't. Frequent operations like __contains__ are implemented as
slots in the C struct describing the type. The code that implements
"obj1 in obj2" simply executes something like:

result = obj1->ob_type->tp_contains(obj2); // intentionally simplified

User-defined types (aka new-style classes) contain additional magic
that react to assignment to YourType.__contains__ (typically performed
while the class is being built), to which they react by setting
"tp_contains" to a C wrapper that calls the intended function and
interprets the result.
 
A

Arnaud Delobelle

J Peyret wrote: [...]
I'll use __metaclass__ solution. Here is my (non-threaded) version:

Great! There's nothing voodoo about metaclasses.
class RDFObject(object):
    _cache = {}   # class variable is shared among all RDFObject
instances
    class __metaclass__(type):
        def __call__(cls, *args, **kwargs):
            return cls.__new__(cls, *args, **kwargs)
    def __new__(cls, uri, *args, **kargs):
        if uri not in cls._cache:
            obj = object.__new__(cls)
            cls._cache[uri] = obj
            obj.__init__(uri, *args, **kargs)
        return cls._cache[uri]
    def __init__(self, uri):
        self.uri = uri
        print self.__class__, uri
    # ... [...]
Perhaps indeed the try...except KeyError is even prettier (no idea
about speed, but let's rename this thread if we want to discuss
performance measurements).

I would go for something like (untested):

def __new__(cls, uri, *args, **kwargs):
obj = cls._cache.get(uri, None):
if obj is None:
obj = cls._cache[uri] = object.__new__(cls)
obj.__init__(uri, *args, **kwargs)
return obj

It doesn't look up the cache sho much and I think reads just as well.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,983
Messages
2,570,187
Members
46,747
Latest member
jojoBizaroo

Latest Threads

Top