Multiple interpreters retaining huge amounts of memory

  • Thread starter Bronner, Gregory
  • Start date
B

Bronner, Gregory

I have an application that simultaneously extends and embeds the python
interpreter.

It is threaded, but all python calls are performed in one thread.
Several interpreters are running simultaneously -- the application
receives an event, activates a particular interpreter, and calls some
python code.

An interpreter's life cycle is to start, load a bunch of extension
modules, run intermittently for 30 -40 minutes, and end.

At some point, the application calls Py_EndInterpreter on each
interpreter.

My memory allocation goes up by about 1MB per interpreter, of which I
know that 2k (swig types) are really leaked. gc.garbage doesn't have any
cycles.


Is there some way to track references per interpreter, or to get the
memory allocator to set up seperate arenas per interpreter so that it
can remove all allocated memory when the interpreter exits?

Thanks



- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

This message is intended only for the personal and confidential use of the designated recipient(s) named above. If you are not the intended recipient of this message you are hereby notified that any review, dissemination, distribution or copying of this message is strictly prohibited. This communication is for information purposes only and should not be regarded as an offer to sell or as a solicitation of an offer to buy any financial product, an official confirmation of any transaction, or as an official statement of Lehman Brothers. Email transmission cannot be guaranteed to be secure or error-free. Therefore, we do not represent that this information is complete or accurate and it should not be relied upon as such. All information is subject to change without notice.
 
M

Martin v. Löwis

Is there some way to track references per interpreter, or to get the
memory allocator to set up seperate arenas per interpreter so that it
can remove all allocated memory when the interpreter exits?

No. The multi-interpreter feature doesn't really work, so you are
basically on your own. If you find out what the problem is, please
submit patches to bugs.python.org.

In any case, the strategy you propose (with multiple arenas) would *not*
work, since some objects have to be shared across interpreters.

Regards,
Martin
 
G

Graham Dumpleton

No. The multi-interpreter feature doesn't really work, so you are
basically on your own. If you find out what the problem is, please
submit patches to bugs.python.org.

In any case, the strategy you propose (with multiple arenas) would *not*
work, since some objects have to be shared across interpreters.

Regards,
Martin

The multi interpreter feature has some limitations, but if you know
what you are doing and your application can be run within those
limitations then it works fine.

If you are going to make a comment such as 'multi-interpreter feature
doesn't really work' you really should substantiate it by pointing to
where it is documented what the problems are or enumerate yourself
exactly what the issues are. There is already enough FUD being spread
around about the ability to run multiple sub interpreters in an
embedded Python application, so adding more doesn't help.

Oh, it would also be nice to know exactly what embedded systems you
have developed which make use of multiple sub interpreters so we can
gauge with what standing you have to make such a comment.

Graham
 
M

Martin v. Löwis

If you are going to make a comment such as 'multi-interpreter feature
doesn't really work' you really should substantiate it by pointing to
where it is documented what the problems are or enumerate yourself
exactly what the issues are. There is already enough FUD being spread
around about the ability to run multiple sub interpreters in an
embedded Python application, so adding more doesn't help.

I don't think the limitations have been documented in a systematic
manner. Some of the problems I know of are:
- objects can easily get shared across interpreters, and often are.
This is particularly true for static variables that extensions keep,
and for static type objects.
- Py_EndInterpreter doesn't guarantee that all objects are released,
and may leak. This is the problem that the OP seems to have.
All it does is to clear modules, sys, builtins, and a few other
things; it is then up to reference counting and the cycle GC
whether this releases all memory or not.
- the mechanism of PEP 311 doesn't work for multiple interpreters.
Oh, it would also be nice to know exactly what embedded systems you
have developed which make use of multiple sub interpreters so we can
gauge with what standing you have to make such a comment.

I have never used that feature myself. However, I wrote PEP 3121
to overcome some of its limitations.

Regards,
Martin
 
G

Graham Dumpleton

Nice to see that your comments do come from some understanding of the
issues. Been number of times in the past when people have gone off
saying things about multiple interpreters, didn't really know what
they were talking about and were just echoing what some one else had
said. Some of the things being said were often just wrong though. It
just gets annoying. :-(

Anyway, a few comments below with pointers to some documentation on
various issues, plus details of other issues I know of.

I don't think the limitations have been documented in a systematic
manner. Some of the problems I know of are:
- objects can easily get shared across interpreters, and often are.
   This is particularly true for static variables that extensions keep,
   and for static type objects.

Yep, but basically a problem with how people write C extension
modules. Ie., they don't write them with the fact that multiple
interpreters can be used in mind.

Until code was fixed recently in trunk, one high profile module which
had this sort of problem was psycop2. Not sure if there has been an
official release yet which includes the fix. From memory the problem
they had was that a static variable was caching a reference to the
type object for Decimal from the interpreter which first loaded and
initialised the module. That type object was then used to create
instances of Decimal type which were passed to other interpreters.
These Decimal instances would then fail isinstance() checks within
those other interpreters.

Some details about this in section 'Multiple Python Sub Interpreters'
of:

http://code.google.com/p/modwsgi/wiki/ApplicationIssues

That section of documentation also highlights some of the other errors
that can arise where file objects in particular are somehow shared
between interpreters, plus issues when unmarshalling data.

You might also read section 'Application Environment Variables' of
that document. This talks about the problem of leakage of environment
variables between sub interpreters. There probably isn't much that one
can do about it as one needs to push changes to os.environ into C
environment variables so various system library calls will get them,
but still quite annoying that the variables set in one interpreter
then show up in interpreters created after that point. It means that
environment variable separation for changes made unique to a sub
interpreter is impossible.
- Py_EndInterpreter doesn't guarantee that all objects are released,
   and may leak. This is the problem that the OP seems to have.
   All it does is to clear modules, sys, builtins, and a few other
   things; it is then up to reference counting and the cycle GC
   whether this releases all memory or not.

There is another problem with deleting interpreters and then creating
new ones. This is where a C extension module doesn't declare reference
counts to static Python objects it creates. When the interpreter is
destroyed and objects that can be destroyed are destroyed, then it may
destroy these objects which are referenced by the static variables.
When a subsequent interpreter is created which tries to use the same C
extension module, that static variable now contains a dangling invalid
pointer to unused or reused memory.

PEP 3121 could help with this by making it more obvious of what
requirements exist on C extension modules to cope with such issues.

I don't know whether it is a fundamental problem with the tool or how
people use it, but Pyrex generated code seems to also do this. This
was showing up in PyProtocols in particular when attempts were made to
recycle interpreters within the lifetime of a process. Other packages
having the problem were pyscopg2 again, lxml and possibly subversion
bindings. Some details on this can be found in section 'Reloading
Python Interpreters' of that document.
- the mechanism of PEP 311 doesn't work for multiple interpreters.

Yep, and since SWIG defaults to using it, it means that SWIG generated
code can't be used in anything but the main interpreter. Subversion
bindings seem to possibly have a lot of issues related to this as
well. Some details on this can be found in section 'Python Simplified
GIL State API' of that document.
I have never used that feature myself. However, I wrote PEP 3121
to overcome some of its limitations.

As well as the above there are a number of other issues as well. Ones
I can remember right now are as follows.

First is that one can't use different versions of a C extension module
in different sub interpreters. This is because the first one loaded
effectively gets priority. Am not even sure you get an error when
another interpreter tries to load a different version, it just assumes
the one already loaded is okay. This can mean one may get a set of
Python wrappers which doesn't match the C extension module. In other
words, C extension modules are global to process and not local to sub
interpreter.

I know I have talked about this one numerous times, but can't seem to
see where I cover it in the documentation I pointed at. I'll have to
make sure I add it if it isn't there.

Second issue is that when you call Py_EndInterpreter, it doesn't do
some of the stuff that would be done if it was the main interpreter.
The two main culprits are that it doesn't try to stop non daemonised
Python threads and it doesn't call functions registered with the
atexit module.

One might argue that it shouldn't be calling atexit registered
functions as the process isn't being shutdown, but in Python such
functions being called are really at the point the main interpreter is
being destroyed, not the process. As such, it may be appropriate such
registered functions be called for a specific sub interpreter as well,
but obviously only for callbacks registered in that sub interpreter.

One of the reasons for calling atexit registered functions for a sub
interpreter is to terminate daemonised threads. If one isn't able to
kill off daemonised threads created within a sub interpreter then they
can keep running while and after the sub interpreter has been
destroyed. This could result in just a Python exception occuring for
that thread causing it to exit, but can also cause it to crash the
process.

To ensure proper cleanup of sub interpreters when being destroyed and
allow hosted applications to do things properly on exit they may want
to do, found it necessary to do these two things explicitly, when
possibly the Python internals should provide a means, even if
optional, to do it.

Anyway, have a read through that document as you might find a few
interesting things in there about the current problems. Some stuff
isn't necessarily documented as the code for the package this relates
to just works around the issues so everything works as one would
expected rather. For example the atexit register functions being
called for sub interpreters.

In general what I have found is that as long as you are aware of the
limitations, multiple interpreters are still usable. The one thing I
would avoid is trying to recycle sub interpreters. Once they are
created, only safe thing to do is to destroy them on process exit and
no sooner. Otherwise you get issues that OP is seeing, but also some
of the issues I describe above.

Hope you have find this and the referenced document interesting. :)

Graham
 
M

Martin v. Löwis

- objects can easily get shared across interpreters, and often are.
Yep, but basically a problem with how people write C extension
modules. Ie., they don't write them with the fact that multiple
interpreters can be used in mind.

I still consider it a bug in Python, and the multiple-interpreter
feature, not so much in the extension modules. Of course, they
may have bugs on top of that, but in general, they have no way
of cleaning up when an interpreter shuts down (until PEP 3121
gets implemented).
Some details about this in section 'Multiple Python Sub Interpreters'
of:

http://code.google.com/p/modwsgi/wiki/ApplicationIssues

A common concern is that people think that the multiple-interpreters
feature is a security mechanism, i.e. works as a sandbox. Maybe that's
more a communication problem than an actual problem with the feature,
however, it can't be emphasized enough that the feature is *not*
a security mechanism: it is possible to get at all objects even of
"other" interpreters.
You might also read section 'Application Environment Variables' of
that document. This talks about the problem of leakage of environment
variables between sub interpreters. There probably isn't much that one
can do about it as one needs to push changes to os.environ into C
environment variables so various system library calls will get them,
but still quite annoying that the variables set in one interpreter
then show up in interpreters created after that point. It means that
environment variable separation for changes made unique to a sub
interpreter is impossible.

That's not really true. You can't use os.environ for that, yes. However,
you can pass explicit environment dictionaries to, say, os.execve. If
some library relies on os.environ, you could hack around this aspect
and do

os.environ = dict(os.environ)

Then you can customize it. Of course, changes to this dictionary now
won't be reflected into the C library's environ, so you'll have to
use execve now (but you should do so anyway in a multi-threaded
application with changing environments).
There is another problem with deleting interpreters and then creating
new ones. This is where a C extension module doesn't declare reference
counts to static Python objects it creates.

Right - that's a clear bug in the module, though. If the Python
documentation is not sufficiently clear about the requirement that
_every_ assignment to a PyObject* needs to be accompanied with a
Py_INCREF, feel free to contribute patches to make that more clear.
I don't know whether it is a fundamental problem with the tool or how
people use it, but Pyrex generated code seems to also do this.

I've never used Pyrex myself, but I would be surprised if it really
had such a severe refcounting error.
Yep, and since SWIG defaults to using it, it means that SWIG generated
code can't be used in anything but the main interpreter. Subversion
bindings seem to possibly have a lot of issues related to this as
well.

Please understand that, when this PEP was written, this issue was
explicitly discussed, and developers explicitly agreed "the multi-
interpreters feature is broken, anyway, so don't let that issue
stop us from providing PEP 311".
First is that one can't use different versions of a C extension module
in different sub interpreters. This is because the first one loaded
effectively gets priority.

That's not supposed to happen, AFAICT. The interpreter keeps track of
loaded extensions by file name, so if the different version lives in
a different file, that should work fine.

Are you using sys.setdlopenflags by any chance? Setting the flags
to RTLD_GLOBAL could have that effect; you'ld get the init function
of the first module always. By default, Python uses RTLD_LOCAL,
so it should be able to keep the different versions apart (on
Unix with libdl; on Windows, symbol resolution is per-DLL anyway).

Kind regards,
Martin
 
G

Graham Dumpleton

That's not really true. You can't use os.environ for that, yes.

Which bit isn't really true? When you do:

os.environ['XYZ'] = 'ABC'

this results in a corresponding call to:

putenv('XYZ=ABC')

as well as setting value in os.environ dictionary.
<class os._Environ at 0x57510>

class _Environ(UserDict.IterableUserDict):
def __setitem__(self, key, item):
putenv(key, item)
self.data[key] = item

Because os.environ is set from the current copy of C environ at time
the sub interpreter is created, then a sub interpreter created at a
later point will have XYZ show up in os.environ of that sub
interpreter.
However,
you can pass explicit environment dictionaries to, say, os.execve. If
some library relies on os.environ, you could hack around this aspect
and do

os.environ = dict(os.environ)

Then you can customize it. Of course, changes to this dictionary now
won't be reflected into the C library's environ, so you'll have to
use execve now (but you should do so anyway in a multi-threaded
application with changing environments).

As a platform provider and not the person writing the application I
can't really do it that way and effectively force people to change
there code to make it work. It also isn't just exec that is the issue,
as there are other system calls which can rely on the environment
variables.

The only half reasonable solution I have ever been able to dream up is
that just prior to first initialising Python that a snapshot of C
environment is taken and as sub interpreters are created os.environ is
replaced with a new instance of the _Environ wrapper which uses the
initial snapshot rather than what the environment is at the time. At
least then each sub interpreter gets a clean copy of what existed when
the process first started.

Even this isn't really a solution though as changes to os.environ by
sub interpreters still end up getting reflected in C environment and
so the C environment becomes an accumulation of settings from
different code sets with a potential for conflict at some point.

Luckily this issue hasn't presented itself as big enough of a problem
at this point to really be concerned.
That's not supposed to happen, AFAICT. The interpreter keeps track of
loaded extensions by file name, so if the different version lives in
a different file, that should work fine.

Are you using sys.setdlopenflags by any chance? Setting the flags
to RTLD_GLOBAL could have that effect; you'ld get the init function
of the first module always. By default, Python uses RTLD_LOCAL,
so it should be able to keep the different versions apart (on
Unix with libdl; on Windows, symbol resolution is per-DLL anyway).

That may be true, but I have seen enough people raise strange problems
that I at least counsel people not to rely on being able to import
different versions in different sub interpreters.

The problems may well just fall into the other categories we have been
discussing. Within Apache at least, another source of problems which
can arise is that Apache, or other Apache modules (eg. PHP), can
directly link to shared libraries where they are then loaded at global
context. Even if a Python module tries to isolate itself, one can
still end up with conflicts between the version of a shared library
that the module may want to use and what something else has already
loaded. The loader scope doesn't always protect against this.

It is also always hard when you aren't yourself having the problem and
you are relying on others to try and debug their problem for you. More
often than not the amount of information they provide isn't that good
and even when you ask them to try specific things for you to test out
ideas, they don't. So often one can never uncover the true problem,
and it has thus become simpler to limit the source of potential
problems and just tell them to avoid doing it. :)

Graham
 
M

Martin v. Löwis

It means that
Which bit isn't really true?

The last sentence ("It means that...").
When you do:

os.environ['XYZ'] = 'ABC'

this results in a corresponding call to:

putenv('XYZ=ABC')

Generally true, but not when you did

os.environ=dict(os.environ)

Furthermore, you can make changes to environment variables
without changing os.environ, which does allow for environment
variable separation across subinterpreters.
As a platform provider and not the person writing the application I
can't really do it that way and effectively force people to change
there code to make it work. It also isn't just exec that is the issue,
as there are other system calls which can rely on the environment
variables.

Which system calls specifically?
It is also always hard when you aren't yourself having the problem and
you are relying on others to try and debug their problem for you. More
often than not the amount of information they provide isn't that good
and even when you ask them to try specific things for you to test out
ideas, they don't. So often one can never uncover the true problem,
and it has thus become simpler to limit the source of potential
problems and just tell them to avoid doing it. :)

You do notice that my comment in that direction (avoid using multiple
interpreters) started that subthread, right :-?

Regards,
Martin
 
G

Graham Dumpleton

Which bit isn't really true?

The last sentence ("It means that...").
When you do:
os.environ['XYZ'] = 'ABC'
this results in a corresponding call to:
putenv('XYZ=ABC')

Generally true, but not when you did

os.environ=dict(os.environ)

Furthermore, you can make changes to environment variables
without changing os.environ, which does allow for environment
variable separation across subinterpreters.
As a platform provider and not the person writing the application I
can't really do it that way and effectively force people to change
there code to make it work. It also isn't just exec that is the issue,
as there are other system calls which can rely on the environment
variables.

Which system calls specifically?

For a start os.system(). The call itself may not rely on environment
variables, but users can expect environment variables they set in
os.environ to be inherited by the program then are running.

There would similarly be issues with use of popen2 module
functionality because it doesn't provide means of specifying a user
specific environment and just inherits current process.

Yes you could rewrite all these with execve in some way, but as I said
it isn't something you can really enforce on someone, especially when
they might be using a third party package which is doing it and it
isn't even their own code.
You do notice that my comment in that direction (avoid using multiple
interpreters) started that subthread, right :-?

I was talking about avoiding use of different versions of a C
extension module in different sub interpreters, not multiple sub
interpreters as a whole.

Graham
 
B

Bronner, Gregory

What objects need to be shared across interpreters?

My thought was to add an interpreter number to the PyThreadState structure, to increment it when Py_NewInterpreter is called, and to keep track of the interpreter that creates each object. On deletion, all memory belonging to these objects would be freed.

Thoughts?

-----Original Message-----
From: "Martin v. Löwis" [mailto:[email protected]]
Sent: Friday, February 01, 2008 8:34 PM
To: (e-mail address removed)
Subject: Re: Multiple interpreters retaining huge amounts of memory
Is there some way to track references per interpreter, or to get the
memory allocator to set up seperate arenas per interpreter so that it
can remove all allocated memory when the interpreter exits?

No. The multi-interpreter feature doesn't really work, so you are basically on your own. If you find out what the problem is, please submit patches to bugs.python.org.

In any case, the strategy you propose (with multiple arenas) would *not* work, since some objects have to be shared across interpreters.

Regards,
Martin

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

This message is intended only for the personal and confidential use of the designated recipient(s) named above. If you are not the intended recipient of this message you are hereby notified that any review, dissemination, distribution or copying of this message is strictly prohibited. This communication is for information purposes only and should not be regarded as an offer to sell or as a solicitation of an offer to buy any financial product, an official confirmation of any transaction, or as an official statement of Lehman Brothers. Email transmission cannot be guaranteed to be secure or error-free. Therefore, we do not represent that this information is complete or accurate and it should not be relied upon as such. All information is subject to change without notice.
 
M

Martin v. Löwis

What objects need to be shared across interpreters?
>
> My thought was to add an interpreter number to the PyThreadState
> structure, to increment it when Py_NewInterpreter is called, and to
> keep track of the interpreter that creates each object. On deletion,
> all memory belonging to these objects would be freed.
>
> Thoughts?

That won't work, unless you make *massive* changes to Python.
There are many global objects that are shared across interpreters:
Py_None, Py_True, PyExc_ValueError, PyInt_Type, and so on. They
are just C globals, and there can be only a single one of them.

If you think you can fix that, start by changing Python so that
Py_None is per-interpreter, then continue with PyBaseObject_Type.

Regards,
Martin
 
R

Rhamphoryncus

The multi interpreter feature has some limitations, but if you know
what you are doing and your application can be run within those
limitations then it works fine.

I've been wondering about this for a while. Given the severe
limitations of it, what are the use cases where multiple interpreters
do work? All I can think of is that it keeps separate copies of
loaded python modules, but since you shouldn't be monkey-patching them
anyway, why should you care?
 
B

Bronner, Gregory

On the off chance that anyone is still following this:
I've got a relatively simple example of a program that loads 100
interpreters (sequentially) which all load the same swig module, do
something trival, and exit.

Each cycle leaks (or loses) 132k, which is a significant hit -- in my
real program the hit is around 800k/interpreter.

I ran it through purify (after rebuilding python with the puremodule, no
pymalloc, no optimization, no threads, and debugging), and while the
results are somewhat ambiguous, it appears that Py_EndInterpreter isn't
cleaning up:


A) The site module
B) The builtins module

Is there some way to properly clean these up prior to the end of
Py_EndInterpreter? It seem to zap everything wether or not it can
correctly clean up the interpreter/modules, and after it runs, all
pointers have been destroyed.


Thanks





-----Original Message-----
From: Rhamphoryncus [mailto:[email protected]]
Sent: Thursday, February 07, 2008 12:38 PM
To: (e-mail address removed)
Subject: Re: Multiple interpreters retaining huge amounts of memory

The multi interpreter feature has some limitations, but if you know
what you are doing and your application can be run within those
limitations then it works fine.

I've been wondering about this for a while. Given the severe
limitations of it, what are the use cases where multiple interpreters do
work? All I can think of is that it keeps separate copies of loaded
python modules, but since you shouldn't be monkey-patching them anyway,
why should you care?


- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

This message is intended only for the personal and confidential use of the designated recipient(s) named above. If you are not the intended recipient of this message you are hereby notified that any review, dissemination, distribution or copying of this message is strictly prohibited. This communication is for information purposes only and should not be regarded as an offer to sell or as a solicitation of an offer to buy any financial product, an official confirmation of any transaction, or as an official statement of Lehman Brothers. Email transmission cannot be guaranteed to be secure or error-free. Therefore, we do not represent that this information is complete or accurate and it should not be relied upon as such. All information is subject to change without notice.
 
M

Martin v. Löwis

Each cycle leaks (or loses) 132k, which is a significant hit -- in my
real program the hit is around 800k/interpreter.

I ran it through purify (after rebuilding python with the puremodule, no
pymalloc, no optimization, no threads, and debugging), and while the
results are somewhat ambiguous, it appears that Py_EndInterpreter isn't
cleaning up:


A) The site module
B) The builtins module

Is there some way to properly clean these up prior to the end of
Py_EndInterpreter?

You might be misinterpreting what you are seeing. Can you provide
that test case so that others are able to reproduce your results?

I would guess that the error is in the SWIG module, not in the
cleanup of the site or builtins modules. They should cleanup fine.
If there was a systematic error with the cleanup of these modules,
presence of the SWIG module should be irrelevant.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top