Re: 2.6, 3.0, and truly independent intepreters
- From: Rhamphoryncus <rhamph@xxxxxxxxx>
- Date: Fri, 24 Oct 2008 14:16:50 -0700 (PDT)
On Oct 24, 3:02 pm, Glenn Linderman <v+pyt...@xxxxxxxxxxxx> wrote:
On approximately 10/23/2008 2:24 PM, came the following characters from the
keyboard of Rhamphoryncus:
On Oct 23, 11:30 am, Glenn Linderman <v+pyt...@xxxxxxxxxxxx> wrote:
On approximately 10/23/2008 12:24 AM, came the following characters from
the keyboard of Christian Heimes
Andy wrote:
I'm very - not absolute, but very - sure that Guido and the initial
designers of Python would have added the GIL anyway. The GIL makes
Python faster on single core machines and more stable on multi core
machines.
Actually, the GIL doesn't make Python faster; it is a design decision that
reduces the overhead of lock acquisition, while still allowing use of global
variables.
Using finer-grained locks has higher run-time cost; eliminating the use of
global variables has a higher programmer-time cost, but would actually run
faster and more concurrently than using a GIL. Especially on a
multi-core/multi-CPU machine.
Those "globals" include classes, modules, and functions. You can't
have *any* objects shared. Your interpreters are entirely isolated,
much like processes (and we all start wondering why you don't use
processes in the first place.)
Or use safethread. It imposes safe semantics on shared objects, so
you can keep your global classes, modules, and functions. Still need
garbage collection though, and on CPython that means refcounting and
the GIL.
Another peeve I have is his characterization of the observer pattern.
The generalized form of the problem exists in both single-threaded
sequential programs, in the form of unexpected reentrancy, and message
passing, with infinite CPU usage or infinite number of pending
messages.
So how do you get reentrancy is a single-threaded sequential program? I
think only via recursion? Which isn't a serious issue for the observer
pattern. If you add interrupts, then your program is no longer sequential..
Sorry, I meant recursion. Why isn't it a serious issue for
single-threaded programs? Just the fact that it's much easier to
handle when it does happen?
Try looking at it on another level: when your CPU wants to read from a
bit of memory controlled by another CPU it sends them a message
requesting they get it for us. They send back a message containing
that memory. They also note we have it, in case they want to modify
it later. We also note where we got it, in case we want to modify it
(and not wait for them to do modifications for us).
I understand that level... one of my degrees is in EE, and I started college
wanting to design computers (at about the time the first microprocessor chip
came along, and they, of course, have now taken over). But I was side-lined
by the malleability of software, and have mostly practiced software during
my career.
Anyway, that is the level that Herb Sutter was describing in the Dr Dobbs
articles I mentioned. And the overhead of doing that at the level of a cache
line is high, if there is lots of contention for particular memory locations
between threads running on different cores/CPUs. So to achieve concurrency,
you must not only limit explicit software locks, but must also avoid memory
layouts where data needed by different cores/CPUs are in the same cache
line.
I suspect they'll end up redesigning the caching to use a size and
alignment of 64 bits (or smaller). Same cache line size, but with
masking.
You still need to minimize contention of course, but that should at
least be more predictable. Having two unrelated mallocs contend could
suck.
Message passing vs shared memory isn't really a yes/no question. It's
about ratios, usage patterns, and tradeoffs. *All* programs will
share data, but in what way? If it's just the code itself you can
move the cache validation into software and simplify the CPU, making
it faster. If the shared data is a lot more than that, and you use it
to coordinate accesses, then it'll be faster to have it in hardware.
I agree there are tradeoffs... unfortunately, the hardware architectures
vary, and the languages don't generally understand the hardware. So then it
becomes an OS API, which adds the overhead of an OS API call to the cost of
the synchronization... It could instead be (and in clever applications is) a
non-portable assembly level function that wraps on OS locking or waiting
API.
In practice I highly doubt we'll see anything that doesn't extend
traditional threading (posix threads, whatever MS has, etc).
Nonetheless, while putting the shared data accesses in hardware might be
more efficient per unit operation, there are still tradeoffs: A software
solution can group multiple accesses under a single lock acquisition; the
hardware probably doesn't have enough smarts to do that. So it may well
require many more hardware unit operations for the same overall concurrently
executed function, and the resulting performance may not be any better.
Speculative ll/sc? ;)
Sidestepping the whole issue, by minimizing shared data in the application
design, avoiding not only software lock calls, and hardware cache
contention, is going to provide the best performance... it isn't the things
you do efficiently that make software fast — it is the things you don't do
at all.
Minimizing contention, certainly. Minimizing the shared data itself
is iffier though.
.
- References:
- 2.6, 3.0, and truly independent intepreters
- From: Andy
- Re: 2.6, 3.0, and truly independent intepreters
- From: Rhamphoryncus
- Re: 2.6, 3.0, and truly independent intepreters
- From: Andy
- Re: 2.6, 3.0, and truly independent intepreters
- From: Rhamphoryncus
- 2.6, 3.0, and truly independent intepreters
- Prev by Date: Re: 2.6, 3.0, and truly independent intepreters
- Next by Date: RE: Python-list Digest, Vol 61, Issue 368
- Previous by thread: Re: 2.6, 3.0, and truly independent intepreters
- Next by thread: Re: 2.6, 3.0, and truly independent intepreters
- Index(es):
Relevant Pages
|