creating garbage collectable objects (caching objects)

N

News123

Hi.

I started playing with PIL.

I'm performing operations on multiple images and would like compromise
between speed and memory requirement.

The fast approach would load all images upfront and create then multiple
result files. The problem is, that I do not have enough memory to load
all files.

The slow approach is to load each potential source file only when it is
needed and to release it immediately after (leaving it up to the gc to
free memory when needed)



The question, that I have is whether there is any way to tell python,
that certain objects could be garbage collected if needed and ask python
at a later time whether the object has been collected so far (image has
to be reloaded) or not (image would not have to be reloaded)


# Fastest approach:
imgs = {}
for fname in all_image_files:
imgs[fname] = Image.open(fname)
for creation_rule in all_creation_rules():
img = Image.new(...)
for img_file in creation_rule.input_files():
img = do_somethingwith(img,imgs[img_file])
img.save()


# Slowest approach:
for creation_rule in all_creation_rules():
img = Image.new(...)
for img_file in creation_rule.input_files():
src_img = Image.open(img_file)
img = do_somethingwith(img,src_img)
img.save()



# What I'd like to do is something like:
imgs = GarbageCollectable_dict()
for creation_rule in all_creation_rules():
img = Image.new(...)
for img_file in creation_rule.input_files():
if src_img in imgs: # if 'm lucke the object is still there
src_img = imgs[img_file]
else:
src_img = Image.open(img_file)
img = do_somethingwith(img,src_img)
img.save()



Is this possible?

Thaks in advance for an answer or any other ideas of
how I could do smart caching without hogging all the system's
memory
 
T

Terry Reedy

News123 said:
Hi.

I started playing with PIL.

I'm performing operations on multiple images and would like compromise
between speed and memory requirement.

The fast approach would load all images upfront and create then multiple
result files. The problem is, that I do not have enough memory to load
all files.

The slow approach is to load each potential source file only when it is
needed and to release it immediately after (leaving it up to the gc to
free memory when needed)

The question, that I have is whether there is any way to tell python,
that certain objects could be garbage collected if needed and ask python
at a later time whether the object has been collected so far (image has
to be reloaded) or not (image would not have to be reloaded)

See the weakref module. But note that in CPython, objects are collected
as soon as there all no normal references, not when 'needed'.
# Fastest approach:
imgs = {}
for fname in all_image_files:
imgs[fname] = Image.open(fname)
for creation_rule in all_creation_rules():
img = Image.new(...)
for img_file in creation_rule.input_files():
img = do_somethingwith(img,imgs[img_file])
img.save()


# Slowest approach:
for creation_rule in all_creation_rules():
img = Image.new(...)
for img_file in creation_rule.input_files():
src_img = Image.open(img_file)
img = do_somethingwith(img,src_img)
img.save()



# What I'd like to do is something like:
imgs = GarbageCollectable_dict()
for creation_rule in all_creation_rules():
img = Image.new(...)
for img_file in creation_rule.input_files():
if src_img in imgs: # if 'm lucke the object is still there
src_img = imgs[img_file]
else:
src_img = Image.open(img_file)
img = do_somethingwith(img,src_img)
img.save()
 
S

Simon Forman

Hi.

I started playing with PIL.

I'm performing operations on multiple images and would like compromise
between speed and memory requirement.

The fast approach would load all images upfront and create then multiple
result files. The problem is, that I do not have enough memory to load
all files.

The slow approach is to load each potential source file only when it is
needed and to release it immediately after (leaving it up to the gc to
free memory when needed)

The question, that I have is whether there is any way to tell python,
that certain objects could be garbage collected if needed and ask python
at a later time whether the object has been collected so far (image has
to be reloaded) or not (image would not have to be reloaded)

# Fastest approach:
imgs = {}
for fname in all_image_files:
    imgs[fname] = Image.open(fname)
for creation_rule in all_creation_rules():
    img = Image.new(...)
    for img_file in creation_rule.input_files():
        img = do_somethingwith(img,imgs[img_file])
    img.save()

# Slowest approach:
for creation_rule in all_creation_rules():
    img = Image.new(...)
    for img_file in creation_rule.input_files():
        src_img = Image.open(img_file)
        img = do_somethingwith(img,src_img)
    img.save()

# What I'd like to do is something like:
imgs = GarbageCollectable_dict()
for creation_rule in all_creation_rules():
    img = Image.new(...)
    for img_file in creation_rule.input_files():
        if src_img in imgs: # if 'm lucke the object is still there
                src_img = imgs[img_file]
        else:
                src_img = Image.open(img_file)
        img = do_somethingwith(img,src_img)
    img.save()

Is this possible?

Thaks in advance for an answer or any other ideas of
how I could do smart caching without hogging all the system's
memory

Maybe I'm just being thick today, but why would the "slow" approach be
slow? The same amount of I/O and processing would be done either way,
no?
Have you timed both methods?

That said, take a look at the weakref module Terry Reedy already
mentioned, and maybe the gc (garbage collector) module too (although
that might just lead to wasting a lot of time fiddling with stuff that
the gc is supposed to handle transparently for you in the first place.)
 
D

Dave Angel

News123 said:
Hi.

I started playing with PIL.

I'm performing operations on multiple images and would like compromise
between speed and memory requirement.

The fast approach would load all images upfront and create then multiple
result files. The problem is, that I do not have enough memory to load
all files.

The slow approach is to load each potential source file only when it is
needed and to release it immediately after (leaving it up to the gc to
free memory when needed)



The question, that I have is whether there is any way to tell python,
that certain objects could be garbage collected if needed and ask python
at a later time whether the object has been collected so far (image has
to be reloaded) or not (image would not have to be reloaded)


# Fastest approach:
imgs = {}
for fname in all_image_files:
imgs[fname] = Image.open(fname)
for creation_rule in all_creation_rules():
img = Image.new(...)
for img_file in creation_rule.input_files():
img = do_somethingwith(img,imgs[img_file])
img.save()


# Slowest approach:
for creation_rule in all_creation_rules():
img = Image.new(...)
for img_file in creation_rule.input_files():
src_img = Image.open(img_file)
img = do_somethingwith(img,src_img)
img.save()



# What I'd like to do is something like:
imgs = GarbageCollectable_dict()
for creation_rule in all_creation_rules():
img = Image.new(...)
for img_file in creation_rule.input_files():
if src_img in imgs: # if 'm lucke the object is still there
src_img = imgs[img_file]
else:
src_img = Image.open(img_file)
img = do_somethingwith(img,src_img)
img.save()



Is this possible?

Thaks in advance for an answer or any other ideas of
how I could do smart caching without hogging all the system's
memory
You don't say what implementation of Python, nor on what OS platform.
Yet you're asking how to influence that implementation.

In CPython, version 2.6 (and probably most other versions, but somebody
else would have to chime in) an object is freed as soon as its reference
count goes to zero. So the garbage collector is only there to catch
cycles, and it runs relatively infrequently.

So, if you keep a reference to an object, it'll not be freed.
Theoretically, you can use the weakref module to keep a reference
without inhibiting the garbage collection, but I don't have any
experience with the module. You could start by studying its
documentation. But probably you want a weakref.WeakValueDictionary.
Use that in your third approach to store the cache.

If you're using Cython or Jython, or one of many other implementations,
the rules will be different.

The real key to efficiency is usually managing locality of reference.
If a given image is going to be used for many output files, you might
try to do all the work with it before going on to the next image. In
that case, it might mean searching all_creation_rules for rules which
reference the file you've currently loaded, measurement is key.
 
N

News123

Dave said:
You don't say what implementation of Python, nor on what OS platform.
Yet you're asking how to influence that implementation.

Sorry my fault. I'm using C-python under Windows and under Linux
In CPython, version 2.6 (and probably most other versions, but somebody
else would have to chime in) an object is freed as soon as its reference
count goes to zero. So the garbage collector is only there to catch
cycles, and it runs relatively infrequently.

If CYthon frees objects as early as possible (as soon as the refcount is
0), then weakref wil not really help me.
In this case I'd have to elaborate into a cache like structure.
So, if you keep a reference to an object, it'll not be freed.
Theoretically, you can use the weakref module to keep a reference
without inhibiting the garbage collection, but I don't have any
experience with the module. You could start by studying its
documentation. But probably you want a weakref.WeakValueDictionary.
Use that in your third approach to store the cache.

If you're using Cython or Jython, or one of many other implementations,
the rules will be different.

The real key to efficiency is usually managing locality of reference.
If a given image is going to be used for many output files, you might
try to do all the work with it before going on to the next image. In
that case, it might mean searching all_creation_rules for rules which
reference the file you've currently loaded, measurement is key.

Changing the order of the images to be calculated is key and I'm working
on that.

For a first step I can reorder the image creation such, that all outpout
images, that depend only on one input image will be calculated one after
the other.

so for this case I can transform:
# Slowest approach:
for creation_rule in all_creation_rules():
img = Image.new(...)
for img_file in creation_rule.input_files():
src_img = Image.open(img_file)
img = do_somethingwith(img,src_img) # wrong indentation in OP
img.save()


into
src_img = Image.open(img_file)
for creation_rule in all_creation_rules_with_on_src_img():
img = Image.new(...)
img = do_somethingwith(img,src_img)
img.save()


What I was more concerned is a group of output images depending on TWO
or more input images.

Depending on the platform (and the images) I might not be able to
preload all two (or more images)

So, as CPython's garbage collection takes always place immediately,
then I'd like to pursue something else.
I can create a cache, which caches input files as long as python leaves
at least n MB available for the rest of the system.

For this I have to know how much RAM is still available on a system.

I'll start looking into this.

thanks again



N
 
D

Dave Angel

News123 said:
Sorry my fault. I'm using C-python under Windows and under Linux


If CYthon frees objects as early as possible (as soon as the refcount is
0), then weakref wil not really help me.
In this case I'd have to elaborate into a cache like structure.


Changing the order of the images to be calculated is key and I'm working
on that.

For a first step I can reorder the image creation such, that all outpout
images, that depend only on one input image will be calculated one after
the other.

so for this case I can transform:
# Slowest approach:
for creation_rule in all_creation_rules():
img = Image.new(...)
for img_file in creation_rule.input_files():
src_img = Image.open(img_file)
img = do_somethingwith(img,src_img) # wrong indentation in OP
img.save()


into
src_img = Image.open(img_file)
for creation_rule in all_creation_rules_with_on_src_img():
img = Image.new(...)
img = do_somethingwith(img,src_img)
img.save()


What I was more concerned is a group of output images depending on TWO
or more input images.

Depending on the platform (and the images) I might not be able to
preload all two (or more images)

So, as CPython's garbage collection takes always place immediately,
then I'd like to pursue something else.
I can create a cache, which caches input files as long as python leaves
at least n MB available for the rest of the system.

For this I have to know how much RAM is still available on a system.

I'll start looking into this.

thanks again



N
As I said earlier, I think weakref is probably what you need. A weakref
is still a reference from the point of view of the ref-counting, but not
from the point of view of the garbage collector. Have you read the help
on weakref module? In particular, did you read Pep 0205?
http://www.python.org/dev/peps/pep-0205/

Object cache is one of the two reasons for the weakref module.
 
G

Gabriel Genellina

News123 wrote:
As I said earlier, I think weakref is probably what you need. A weakref
is still a reference from the point of view of the ref-counting, but not
from the point of view of the garbage collector. Have you read the help
on weakref module? In particular, did you read Pep 0205?
http://www.python.org/dev/peps/pep-0205/

You've misunderstood something. A weakref is NOT "a reference from the
point of view of the ref-counting", it adds zero to the reference count.
When the last "real" reference to some object is lost, the object is
destroyed, even if there exist weak references to it. That's the whole
point of a weak reference. The garbage collector isn't directly related.

py> from sys import getrefcount as rc
py> class X(object): pass
....
py> x=X()
py> rc(x)
2
py> y=x
py> rc(x)
3
py> import weakref
py> r=weakref.ref(x)
py> r
<weakref at 00BE56C0; to 'X' at 00BE4F30>
py> rc(x)
3
py> del y
py> rc(x)
2
py> del x
py> r
<weakref at 00BE56C0; dead>

(remember that getrefcount -as any function- holds a temporary reference
to its argument, so the number it returns is one more than the expected
value)
Object cache is one of the two reasons for the weakref module.

....when you don't want the object to stay artificially alive just because
it's referenced in the cache. But the OP wants a different behavior, it
seems. A standard dictionary where images are removed when they're no more
needed (or a memory restriction is fired).
 
D

Dave Angel

Gabriel said:
You've misunderstood something. A weakref is NOT "a reference from the
point of view of the ref-counting", it adds zero to the reference
count. When the last "real" reference to some object is lost, the
object is destroyed, even if there exist weak references to it. That's
the whole point of a weak reference. The garbage collector isn't
directly related.

py> from sys import getrefcount as rc
py> class X(object): pass
...
py> x=X()
py> rc(x)
2
py> y=x
py> rc(x)
3
py> import weakref
py> r=weakref.ref(x)
py> r
<weakref at 00BE56C0; to 'X' at 00BE4F30>
py> rc(x)
3
py> del y
py> rc(x)
2
py> del x
py> r
<weakref at 00BE56C0; dead>

(remember that getrefcount -as any function- holds a temporary
reference to its argument, so the number it returns is one more than
the expected value)


...when you don't want the object to stay artificially alive just
because it's referenced in the cache. But the OP wants a different
behavior, it seems. A standard dictionary where images are removed
when they're no more needed (or a memory restriction is fired).
Thanks for correcting me. As I said earlier, I have no experience with
weakref. The help and the PEP did sound to me like it would work for
his needs.

So how about adding an attribute in the large object that refers to the
object iself?. Then the ref count will never go to zero, but it can be
freed by the gc. Also store the ref in a WeakValueDictionary, and you
can find the object without blocking its gc.

And no, I haven't tried it, and wouldn't unless a machine had nothing
important running on it. Clearly, the gc might not be able to keep up
with this kind of abuse. But if gc is triggered by any attempt to make
too-large an object, it might work.

DaveA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top