Re: 2.6, 3.0, and truly independent intepreters



On Oct 24, 3:02 pm, Glenn Linderman <v+pyt...@xxxxxxxxxxxx> wrote:
On approximately 10/23/2008 2:24 PM, came the following characters from the
keyboard of Rhamphoryncus:

On Oct 23, 11:30 am, Glenn Linderman <v+pyt...@xxxxxxxxxxxx> wrote:


On approximately 10/23/2008 12:24 AM, came the following characters from
the keyboard of Christian Heimes

Andy wrote:
I'm very - not absolute, but very - sure that Guido and the initial
designers of Python would have added the GIL anyway. The GIL makes
Python faster on single core machines and more stable on multi core
machines.

Actually, the GIL doesn't make Python faster; it is a design decision that
reduces the overhead of lock acquisition, while still allowing use of global
variables.

Using finer-grained locks has higher run-time cost; eliminating the use of
global variables has a higher programmer-time cost, but would actually run
faster and more concurrently than using a GIL. Especially on a
multi-core/multi-CPU machine.

Those "globals" include classes, modules, and functions. You can't
have *any* objects shared. Your interpreters are entirely isolated,
much like processes (and we all start wondering why you don't use
processes in the first place.)

Or use safethread. It imposes safe semantics on shared objects, so
you can keep your global classes, modules, and functions. Still need
garbage collection though, and on CPython that means refcounting and
the GIL.


Another peeve I have is his characterization of the observer pattern.
The generalized form of the problem exists in both single-threaded
sequential programs, in the form of unexpected reentrancy, and message
passing, with infinite CPU usage or infinite number of pending
messages.


So how do you get reentrancy is a single-threaded sequential program? I
think only via recursion? Which isn't a serious issue for the observer
pattern. If you add interrupts, then your program is no longer sequential..

Sorry, I meant recursion. Why isn't it a serious issue for
single-threaded programs? Just the fact that it's much easier to
handle when it does happen?


Try looking at it on another level: when your CPU wants to read from a
bit of memory controlled by another CPU it sends them a message
requesting they get it for us. They send back a message containing
that memory. They also note we have it, in case they want to modify
it later. We also note where we got it, in case we want to modify it
(and not wait for them to do modifications for us).


I understand that level... one of my degrees is in EE, and I started college
wanting to design computers (at about the time the first microprocessor chip
came along, and they, of course, have now taken over). But I was side-lined
by the malleability of software, and have mostly practiced software during
my career.

Anyway, that is the level that Herb Sutter was describing in the Dr Dobbs
articles I mentioned. And the overhead of doing that at the level of a cache
line is high, if there is lots of contention for particular memory locations
between threads running on different cores/CPUs. So to achieve concurrency,
you must not only limit explicit software locks, but must also avoid memory
layouts where data needed by different cores/CPUs are in the same cache
line.

I suspect they'll end up redesigning the caching to use a size and
alignment of 64 bits (or smaller). Same cache line size, but with
masking.

You still need to minimize contention of course, but that should at
least be more predictable. Having two unrelated mallocs contend could
suck.


Message passing vs shared memory isn't really a yes/no question. It's
about ratios, usage patterns, and tradeoffs. *All* programs will
share data, but in what way? If it's just the code itself you can
move the cache validation into software and simplify the CPU, making
it faster. If the shared data is a lot more than that, and you use it
to coordinate accesses, then it'll be faster to have it in hardware.


I agree there are tradeoffs... unfortunately, the hardware architectures
vary, and the languages don't generally understand the hardware. So then it
becomes an OS API, which adds the overhead of an OS API call to the cost of
the synchronization... It could instead be (and in clever applications is) a
non-portable assembly level function that wraps on OS locking or waiting
API.

In practice I highly doubt we'll see anything that doesn't extend
traditional threading (posix threads, whatever MS has, etc).


Nonetheless, while putting the shared data accesses in hardware might be
more efficient per unit operation, there are still tradeoffs: A software
solution can group multiple accesses under a single lock acquisition; the
hardware probably doesn't have enough smarts to do that. So it may well
require many more hardware unit operations for the same overall concurrently
executed function, and the resulting performance may not be any better.

Speculative ll/sc? ;)


Sidestepping the whole issue, by minimizing shared data in the application
design, avoiding not only software lock calls, and hardware cache
contention, is going to provide the best performance... it isn't the things
you do efficiently that make software fast — it is the things you don't do
at all.

Minimizing contention, certainly. Minimizing the shared data itself
is iffier though.
.



Relevant Pages

  • Re: FILEDESC_LOCK() implementation
    ... may find that by making locking more complex, we cause cache problems, ... I apologize for not understanding all of the uses of the FILEDESC lock ... descriptor lookup is performed against a common file descriptor array. ... This contention problem also affects MySQL, ...
    (freebsd-current)
  • Re: thread memory size
    ... Assume threads A1 and A2 both frequently access structure A and ... contention and the shared collection will constantly ping-pong from ... cache to cache. ... actually mean a CPU lock produced by a CAS instruction. ...
    (comp.os.linux.development.system)
  • [PATCH 00/25] Cleanup and optimise the page allocator V6
    ... Here is V6 of the cleanup and optimisation of the page allocator and it ... L1 cache misses are reduced by about 7.36% and L2 cache misses were reduced ... The lock contention on some machines goes up for the the zone->lru_lock ...
    (Linux-Kernel)
  • [PATCH 00/25] Cleanup and optimise the page allocator V5
    ... Pass 1 at making the page allocator faster. ... L1 cache misses are reduced by about 7.36% and L2 cache misses were reduced ... The lock contention on some machines goes up for the the zone->lru_lock ...
    (Linux-Kernel)
  • Re: Possible ways of dealing with OOM conditions.
    ... There is more to networking that skbs only, what about route cache, ... With power-of-two allocation SLAB wastes 500 bytes for each 1500 MTU ... Well, if you have such hardware its not rare at all, But yeah that ... adds to the fragmentation issues on the page-allocator level. ...
    (Linux-Kernel)