what's the point of rpython?

R

Rhamphoryncus

I'm not sure whether multicore processors share a cache or, if not, have
some other on-chip mechanism. Multiprocessor machines, however, are a
different matter...

They share some, but also have some exclusive. How much of which
depends entirely on which CPU you have.
 
P

Paul Rubin

The cache coherency mechanism automatically prevents two or
more processors that have cached the same area of memory from
simultaneously modifying data in that area.

The same cache coherency mechanism that prevents ordinary "unlocked"
instructions from simulanteously modifying the same cache line on
two different processors also provides the guarantee with "locked"
instructions. There's no additional hardware locks involved, and no
additional communication required.

The cache coherency mechanism is what Scott described as
"communicat[ing] to other processors that it (and it alone) owns the
increment target". The cache coherency mechanism is not a trivial
thing at all. It introduces its own hazards and delays, and it is
getting more complicated all the time as processors and caches get
faster and larger. Some time ago, cpu's hit their megahertz limits
and that's why we're using multicores now. Some PL researchers think
cache coherency is going to be the next limit, and are advocating
languages like Erlang, which avoid use of shared memory and have
separate heaps per thread; or alternatively, approaches like the MS
Singularity research OS which relies on something like a linear type
system to statically ensure that a given object is accessible to only
one thread at a time. (That approach allows transferring objects
between threads with no locks or copying required).
 
P

Paul Rubin

Rhamphoryncus said:
a) The contended case is the issue, not the uncontended case. An
uncontended lock is just constant overhead, not a barrier to
scalability

a1) Really what matters is the actual mix between contended and
uncontended accesses, and the synchronization strategy affects the
amount of contention. For example, the traditional locking strategy
involves acquiring a lock before reading the object, so two
simultaneous read-only accesses would create lock contention. With
STM, only updates acquire a lock, so multiple read-only threads can
access the object simultaneously with no contention.

a2) Constant factors matter!!! If using a different mechanism makes
Python 2x faster, that's a very powerful reason to favor the other
mechanism.
b) Locks on linux (since the switch to futexes) are pretty close to
free when uncontended.

I thought futexes use a traditional locked read-modify-write
instruction which is 50-100 cycles, cheap compared to a system call
but quite expensive compared with a normal 1-cycle non-locked
operation.

The second issue has *nothing* to do with refcounts. It is the
objects themselves. A classic example is trying to do "i += 1", which
breaks down into "i = i + 1", which isn't an atomic operation.

Oh, I see what you mean; yes, you're right, it's best to avoid doing
those sorts of operation in shared objects too frequently.
Java's ConcurrentHash map gets this right.

I'll have to look that up, it sounds interesting. I've been staying
away from Java but a situation is coming up where I may have to (ugh)
start using it. Thanks.
 
R

Rhamphoryncus

a1) Really what matters is the actual mix between contended and
uncontended accesses, and the synchronization strategy affects the
amount of contention.  For example, the traditional locking strategy
involves acquiring a lock before reading the object, so two
simultaneous read-only accesses would create lock contention.  With
STM, only updates acquire a lock, so multiple read-only threads can
access the object simultaneously with no contention.  

Aye, but my point is really about the myth of lock-free algorithms
being uncontending — it's simply not true, and CAN'T be true. A write
is inherently a mutually exclusive operation. There's all sorts of
ways to avoid contending for reads, spread out the writes and have a
single thread coalesce them, etc, but fundamentally the write will
involve some mutual exclusion.
 
A

Aahz

The performance decrease is an artifact of CPython's rather primitive
storage management (reference counts in every object). This is
pervasive and can't really be removed. But a new implementation
(e.g. PyPy) can and should have a real garbage collector that doesn't
suffer from such effects.

CPython's "primitive" storage management has a lot to do with the
simplicity of interfacing CPython with external libraries. Any solution
that proposes to get rid of the GIL needs to address that.
 
A

Aahz

In fact all it does simply talk to its cache. From the "Intel 64 and
IA-32 Architectures Software Developer's Manual, Volume 3A: System
Programming Guide, Part 1":

For the P6 and more recent processor families, if the area of
memory being locked during a LOCK operation is cached in the
processor that is performing the LOCK operation as write-back
memory and is completely contained in a cache line, the processor
may not assert the LOCK# signal on the bus. Instead, it will
modify the memory location internally and allow it's cache
coherency mechanism to insure that the operation is carried
out atomically. This operation is called "cache locking." The
cache coherency mechanism automatically prevents two or more
processors that have cached the same area of memory from
simultaneously modifying data in that area.

The same cache coherency mechanism that prevents ordinary "unlocked"
instructions from simulanteously modifying the same cache line on
two different processors also provides the guarantee with "locked"
instructions. There's no additional hardware locks involved, and no
additional communication required.

IIRC, it was Bruce Eckel I heard talking about discovering all kinds of
nasty Java thread bugs because cache coherency wasn't working the way the
Java developers thought it did.... This is apparently something very
difficult to get correct.
 
M

MRAB

Paul said:
The cache coherency mechanism automatically prevents two or more
processors that have cached the same area of memory from
simultaneously modifying data in that area.

The same cache coherency mechanism that prevents ordinary
"unlocked" instructions from simulanteously modifying the same
cache line on two different processors also provides the guarantee
with "locked" instructions. There's no additional hardware locks
involved, and no additional communication required.

The cache coherency mechanism is what Scott described as
"communicat[ing] to other processors that it (and it alone) owns the
increment target". The cache coherency mechanism is not a trivial
thing at all. It introduces its own hazards and delays, and it is
getting more complicated all the time as processors and caches get
faster and larger. Some time ago, cpu's hit their megahertz limits
and that's why we're using multicores now. Some PL researchers think
cache coherency is going to be the next limit, and are advocating
languages like Erlang, which avoid use of shared memory and have
separate heaps per thread; or alternatively, approaches like the MS
Singularity research OS which relies on something like a linear type
system to statically ensure that a given object is accessible to
only one thread at a time. (That approach allows transferring
objects between threads with no locks or copying required).
How much difference would it make if the reference counts weren't in
cached memory? I'm thinking that an object could have a pointer to its
reference count, which would be stored elsewhere in some uncached memory.
 
R

Ross Ridge

Ross Ridge said:
The same cache coherency mechanism that prevents ordinary "unlocked"
instructions from simulanteously modifying the same cache line on
two different processors also provides the guarantee with "locked"
instructions. There's no additional hardware locks involved, and no
additional communication required.

Paul Rubin said:
The cache coherency mechanism is what Scott described as
"communicat[ing] to other processors that it (and it alone) owns the
increment target". The cache coherency mechanism is not a trivial
thing at all. It introduces its own hazards and delays, and it is
getting more complicated all the time as processors and caches get
faster and larger.

*sigh* Could you please just read what I wrote above? The LOCK prefix
makes NO DIFFERENCE to anything you mentioned. When current CPython
implementation increments a referernce count, it doesn't use the LOCK
prefix, but the cache line modified by the instruction still has to be
owned by the processor. If it did use the LOCK prefix there would be
no change in how how the cache coherency mechanism worked. There'd be
no additional complications, hazards or delays in how the processor
communicates with other processors.

Ross Ridge
 
C

Carl Banks

CPython's "primitive" storage management has a lot to do with the
simplicity of interfacing CPython with external libraries.  Any solution
that proposes to get rid of the GIL needs to address that.

I recently was on a long road trip, and was not driver, and with
nothing better to do thought quite a bit about how this.

I concluded that, aside from one major trap, it wouldn't really be
more difficult to inteface Python to external libraries, just
differently difficult. Here is briefly what I came up with:

1. Change the singular Python type into three metatypes:
immutable_type, mutable_type, and mutable_dict_type. (In the latter
case, the object itself is immutable but the dict can be modified.
This, of course, would be the default metaclass in Python.) Only
mutable_types would require a mutex when accessing.

2. API wouldn't have to change much. All regular API would assume
that objects are unlocked (if mutable) and in a consistent state.
It'll lock any mutable objects it needs to access. There would also
be a low-level API that assumes the objects are locked (if mutable)
and does not require objects to be consistent. I imagine most
extensions would call the standard API most of the time.

3. If you are going to use the low-level API on a mutable object, or
are going to access the object structure directly, you need to acquire
the object's mutex. Macros such as Py_LOCK(), Py_LOCK2(), Py_UNLOCK()
would be provided.

4. Objects would have to define a method, to be called by the GC, that
marks every object it references. This would be a lot like the
current tp_visit, except it has to be defined for any object that
references another object, not just objects that can participate in
cycles. (A conservative garbage collector wouldn't suffice for Python
because Python quite often allocates blocks but sets the pointer to an
offset within the block. In fact, that's true of almost any Python-
defined type.) Unfortunately, references on the stack would need to
be registered as well, so "PyObject* p;" might have to be replaced
with something like "Py_DECLARE_REF(PyObject,p);" which magically
registers it. Ugly.

5. Py_INCREF and Py_DECREF are gone.

6. GIL is gone.

So, you gain the complexity of a two-level API, having to lock mutable
objects sometimes, and defining more visitor methods than before, but
you don't have to keep INCREFs and DECREFs straight, which is no small
thing.

The major trap is the possibily of deadlock. To help minimize the
risk there would be macros to lock multiple objects at once. Py_LOCK2
(a,b), which guarantess that if in another thread is calling Py_LOCK2
(b,a) at the same time, it won't result in a deadlock. What's
disappointing is that the deadlocking possibility is always with you,
much like the reference counts are.


Carl Banks
 
P

Paul Rubin

CPython's "primitive" storage management has a lot to do with the
simplicity of interfacing CPython with external libraries. Any solution
that proposes to get rid of the GIL needs to address that.

This, I don't understand. Other languages like Lisp and Java and
Haskell have foreign function interfaces that easier to program than
Python's, -and- they don't use reference counts. There's usually some
primitive to protect objects from garbage collection while the foreign
function is using them, etc. The Java Native Interface (JNI) and the
Haskell FFI are pretty well documented. The Emacs Lisp system is not
too hard to figure out from examining the source code, etc.
 
R

Rhamphoryncus

I recently was on a long road trip, and was not driver, and with
nothing better to do thought quite a bit about how this.

I concluded that, aside from one major trap, it wouldn't really be
more difficult to inteface Python to external libraries, just
differently difficult.  Here is briefly what I came up with:

1. Change the singular Python type into three metatypes:
immutable_type, mutable_type, and mutable_dict_type.  (In the latter
case, the object itself is immutable but the dict can be modified.
This, of course, would be the default metaclass in Python.)  Only
mutable_types would require a mutex when accessing.

2. API wouldn't have to change much.  All regular API would assume
that objects are unlocked (if mutable) and in a consistent state.
It'll lock any mutable objects it needs to access.  There would also
be a low-level API that assumes the objects are locked (if mutable)
and does not require objects to be consistent.  I imagine most
extensions would call the standard API most of the time.

3. If you are going to use the low-level API on a mutable object, or
are going to access the object structure directly, you need to acquire
the object's mutex. Macros such as Py_LOCK(), Py_LOCK2(), Py_UNLOCK()
would be provided.

4. Objects would have to define a method, to be called by the GC, that
marks every object it references.  This would be a lot like the
current tp_visit, except it has to be defined for any object that
references another object, not just objects that can participate in
cycles.  (A conservative garbage collector wouldn't suffice for Python
because Python quite often allocates blocks but sets the pointer to an
offset within the block.  In fact, that's true of almost any Python-
defined type.)  Unfortunately, references on the stack would need to
be registered as well, so "PyObject* p;" might have to be replaced
with something like "Py_DECLARE_REF(PyObject,p);" which magically
registers it.  Ugly.

5. Py_INCREF and Py_DECREF are gone.

6. GIL is gone.

So, you gain the complexity of a two-level API, having to lock mutable
objects sometimes, and defining more visitor methods than before, but
you don't have to keep INCREFs and DECREFs straight, which is no small
thing.

The major trap is the possibily of deadlock.  To help minimize the
risk there would be macros to lock multiple objects at once.  Py_LOCK2
(a,b), which guarantess that if in another thread is calling Py_LOCK2
(b,a) at the same time, it won't result in a deadlock.  What's
disappointing is that the deadlocking possibility is always with you,
much like the reference counts are.

IMO, locking of the object is a secondary problem. Python-safethread
provides one solution, but it's not the only conceivable one. For the
sake of discussion it's easier to assume somebody else is solving it
for you.

Instead, focus on just the garbage collection. What are the practical
issues of modifying CPython to use a tracing GC throughout? It
certainly is possible to write an exact GC in C, but the stack
manipulation would be hideous. It'd also require significant rewrites
of the entire code base. Throw on that the performance is unclear (it
could be far worse for a single-threaded program), with no
straightforward way to make it a compile-time option..

Got any ideas for that?
 
K

Kay Schluehr

Whatever sufficiently sophisticated topic was the initially discussed
it ends all up in a request for removing reference counting and the
GIL.
 
C

Chris Rebert

Whatever sufficiently sophisticated topic was the initially discussed
it ends all up in a request for removing reference counting and the
GIL.

+1 QOTW

- Chris
 
C

Carl Banks

On Jan 22, 6:00 am, (e-mail address removed) (Aahz) wrote:
IMO, locking of the object is a secondary problem.  Python-safethread
provides one solution, but it's not the only conceivable one.  For the
sake of discussion it's easier to assume somebody else is solving it
for you.

That assumption might be good for the sake of the discussion *you*
want to have, but it's not for discussion I was having, which was to
address Aahz's claim that GIL makes extension writing simple by
presenting a vision of what Python might be like if it had a mark-and-
sweep collector. The details of the GC are a small part of that and
wouldn't affect my main point even if they are quite different than I
described. Also, extension writers would have to worry about locking
issues here, so it's not acceptable to assume somebody else will solve
that problem.

Instead, focus on just the garbage collection.
[snip rest of threadjack]

You can ignore most of what I was talking about and focus on
technicalities of garbage collection if you want to. I will not be
joining you in that discussion, however.


Carl Banks
 
P

Paul Rubin

Carl Banks said:
3. If you are going to use the low-level API on a mutable object, or
are going to access the object structure directly, you need to acquire
the object's mutex. Macros such as Py_LOCK(), Py_LOCK2(), Py_UNLOCK()
would be provided.

You mean every time you access a list or dictionary or class instance,
you have to acquire a mutex? That sounds like a horrible slowdown.
 
C

Carl Banks

You mean every time you access a list or dictionary or class instance,
you have to acquire a mutex?  That sounds like a horrible slowdown.

Yes, and it's never going to happen in CPython any other way. It's
considered a bug if Python code can segfault the interpreter; all
runtime errors are supposed to raise exceptions. The only way to
ensure that won't happen is to make sure that only one thread can can
access the internals of a mutable object at a time.

BTW, class instances are usually immutable and thus don't require a
mutex in the system I described.


Carl Banks
 
P

Philip Semanchuk

Whatever sufficiently sophisticated topic was the initially discussed
it ends all up in a request for removing reference counting and the
GIL.

Is this a variant of Godwin's Law for Python?
 
K

Kay Schluehr

Is this a variant of Godwin's Law for Python?

Definitely. It's a stable fixed point attractor. No matter how often
it was discussed to dead in the past months the likelihood that
someone mentions the GIL or ref-counting approaches 1. This is
particularly remarkable because it is inverse proportional to the
observable activity in this domain so there are really no news.

Other similarly strange phenomena: whenever Xah Lee posts one of his
infamous rants it attracts at least a dozen of newsgroup readers that
try to persuade each other not to respond which will inevitably grow
his thread and keep it alive for a long time.
 
S

Steve Holden

Paul said:
You mean every time you access a list or dictionary or class instance,
you have to acquire a mutex? That sounds like a horrible slowdown.

Indeed it would, but hey, let's not let that stop us repeating the
thinking that's gone into CPython over the last fifteen years. "Those
who cannot remember the past are condemned to repeat it".

regards
Steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,299
Messages
2,571,544
Members
48,295
Latest member
Adriene271

Latest Threads

Top