Re: Lock-free reference counting

From: Jon Harrop <j...@xxxxxxxxxxxxxxxxx>
Incrementing and decrementing scattered memory locations is the
main reason reference counting is so slow (i.e. cache incoherence).

Suppose you have an extra thread dedicated to updating reference
counts. All the other processes pass signals/messages/events to it
requesting such-and-such reference count to be adjusted up or down.
The rc-thread then organizes these requests according to which page
of memory they apply to, and executes them in correct sequence in
large blocks of such adjustments per page, i.e. doing a whole bunch
of RC adjustments on this one page, then a whole bunch on this
other page, etc. To improve locality of reference within the
distribution from requested events to per-page events, several
whole pages of requests still in original sequence would be queued
up in consecutive memory locations, then in a batch the RC thread
would stable-sort the data into target location sequence using a
stable-sort algorithm that has good locality of reference but
insured n*log(n) time, such as my or Knuth's in-place array
merge-sort algorithms. Then a sequential pass through this sorted
array would cancel adjacent opposites, leaving only the net
adjustments as a much smaller sorted array. The return value from
this sequential pass would be a priority list showing how many
updates are needed for each page of heap. Priority might be
determined by the largest number of heap objects needing
adjustment, or by the largest number of *downward* adjustments
because those are mostly likly to achieve reference count of zero
hence reclaiming memory immediately.

In this way you need only a single reference count for each heap
object shared by multiple threads, any thread can request
adjustment of a reference count at any time without waiting for a
lock, and there's very little increased contention for swapping of
pages hence unlikely to induce thrashing.

If there's a global clock accessible to each process instantly,
then each request for RC adjustment can be timestamped, and when
the RC thread is processing such requests it can be sure to never
process any adjustment request until that clock time has already
passed, thereby avoiding accidently processing a later decrement
before an earlier increment thereby mistaking a RC to have reached
zero when in fact it should still be positive. To make sure a
resource doesn't remain active indefinitely despite the fact its RC
has already reached zero, because some other pages of memory have
more adjustments hence higher priority for being processed, the
timestamp of the oldest request for each page could be part of the
priority calculation. For example, number of decrement requests
within each page of heap could usually determine priority of
processing, except any page that has a still-unprocessed decrement
request older than ten milliseconds could automatically get urgent

Have I answered all the potential bugs in my idea, or did I
overlook something?

Of course a reference-count system would be useful only for
applications where long-lived non-permanent data is guaranteed not
to contain any long-lived pointer loops, so that eventually all
loops within that data will be gone and hence the reference count
will then be *correct* and hence allow reclaiming any
no-longer-accessible data. Thus pointer loops would be allowed for
efficiency only locally within an algorithm, and such loops must be
broken up before returning to caller. Note carefully the
distinction between the requirement (in C) of explicitly calling
FREE on any object before discarding the last pointer to it, which
causes a bug if the user has been using a debugger to look at the
object and wants to continue looking at it but instead is now
looking at unallocated heap or even worse memory that has been
recycled for some other use, compared to my requirement here of
simply breaking up pointer loops before discarding the last
pointer, whereby if the user is using a debugger to hold onto the
object it'll just look different with the loops removed but won't
be looking at unallocated heap or recycled memory.

What do you mean? The object is destroyed immediately when the last
reference to it dies.

Which is actually moot for some purposes:
- An object has a reference to a system/network resource, and
destroying the object involves closing the system/network
resource, and:
- It costs money, or interferes with other access, to keep the
resource alive, so it's nice to close it as soon as possible
to minimize such cost.
- It *must* be closed before some *other* resource is closed,
such as when moving data from one place to another you really
want to close the output stream and make sure the data is
fully written to output device in a permanently stable way
*before* even starting to delete the input file.
The second sub-case is best handled by a call-back or event
signalling when the output really has already been safely written,
which is then joined with other events relating to the main
computing process being finished and ready to delete the input
file, after which the delete task may then be initiated. The first
sub-case can be handled by a concept of "urgency" for some tasks,
whereby that task has higher priority (whenever not blocked such as
by disk wait for sector to reach a head or disk busy with other
process already started) hence really does finish as soon as
reasonably possible.

By "moot" I mean that it logically *can't* be destroyed
immediately, because it *must* wait for the system/network resource
to be completely closed. The key is that it be closed (and object
destroyed) as soon as possible (first sub-case) or that we *know*
for sure when it has been closed (and object destroyed) before
proceeding with another task. By saying you want it destroyed
"immediately", you are making up a strawman requirement that isn't
actually possible and isn't necessary to achieve the desired
protection against malbehaviour.

That is a common misconception. Counts are decremented in
destructors when smart pointers fall out of scope but falling out
of scope is not the same as the last point of reference. For
example, this keeps "a" alive for the duration of the call to "g"
even though "a" is unreferenced:
ptr<int> a = ...;
// No references to "a" exist past here
g(); // Presumably this take some time to run, causing a problem?
// Count of "a" is decremented here and "a" is freed.

Change that code to read as follows:
ptr<int> a = ...;
// No references to "a" exist past here
a = NULL; // Bug fixed by REM, 2008.Nov.23 12:53 PST, that will be $5 please
// Count of "a" is decremented here and "a" is freed.
g(); // Presumably this take some time to run, no longer a problem?

In contrast, real GCs act upon the true network of live
references at run-time and are completely oblivious to the notion
of scope from the source code (which is incidental, after all).

I agree. Although it is *possible* to write some applications in a
way that strictly holds activity of data to dynamic in/out scope of
pointers to them as they get put on the stack and later unwound
from the stack, that's like a straightjacket for some algorithms,
making it quite painful to need to adhere to that paradigm. Having
a true GC, even one based on reference counts, is liberating!

So real GCs can and do collect values sooner than reference
counted smart pointers in C++.

s/can and do/can and might sometimes/

Sometimes ordinary RC collects the data first (such as if you
assign NULL as soon as you no longer need a pointer, and there are
no inaccessible pointer loops).
Sometimes traditional Lisp-style GC collects the data first (such
as whenever there are inaccessible pointer loops)
Sometimes smart pointers in C++ collect the data first (such as if
there are multiple threads and any complicated RC system is used
to avoid locks except when actually reclaiming, it might
occasionally be slower than C++).

This is not a question of subjective belief: reference counting
was dismissed as a serious form of garbage collection decades ago,
following extensive performance analysis. Since then, a widening
memory gap has worsened its problems.

Do you have any understanding of *why* the performance is so bad,
that you could explain here? For example, is it because whenever a
reference count must be adjusted, the page on which it resides must
be marked impure, which occurs more often in practice than sweeping
of pages by the GC, thus increases contention for swapping pages?
Or because the need to avoid inaccessible pointer loops causes the
applications themselves to need to be written in more inefficient
ways? Or what?? Perhaps some of the newer ideas for implementing RC
would avoid the slow-up and make RC viable, if only somebody here
could say succictly what that slow-up is with RC.