Re: Observer pattern limitations




Dmitry A. Kazakov wrote:
On 16 Jul 2006 19:49:27 -0700, David Barrett-Lennard wrote:

Dmitry A. Kazakov wrote:

Finalization triggers the process. A weak reference held by the storage
object becomes a notification, before actual finalization starts. This
gives the storage object an opportunity to sync the memory object (which is
still fully functional), just before it gets finalized and then freed.

Here's another attempt to understand...

An OODBMS typically has special internal data structures that allow it
to keep track of what has been faulted into memory. For simplicity I
won't assume a page based approach. Basically this takes the form of
an invertible mapping between OIDs and virtual address (ie memory)
locations called a Resident Object Table (ROT). The map doesn't
contain entries for objects that don't reside in memory. Therefore
failure to find an entry for given OID tells you that the object
doesn't reside in memory. The ROT is used for swizzling/unswizzling.

An important concept is that the ROT only holds weak references to the
objects in memory. This allows objects in memory to be evicted, say
using an LRU policy. Note however that objects that have been changed
must be pinned in memory.


You seem to be breaking the system up into separate "spaces" in which
objects reside. A space may be associated with the virtual address
space of a process, or it may be associated with an OODB on disk. This
allows you to say that an object "resides on disk".

I think you assume an object is usually tied to the space in which it
resides. ie it can't generally migrate. Nevertheless, you allow an
object in one space to act as a proxy for an object in another space.
In particular, for when an object in memory is regarded as a proxy for
an object on disk.

Yes

You say that storage objects (ie objects on disk) contain weak
references to their associated proxies (if any) in memory.

As "storage object" I referred the thing you called ROT.

Surely
this is only conceptual because you don't want objects on disk to
directly store memory address values. The idea is simply that
conceptually a proxy is not kept alive just because of its association
to a storage object. Otherwise all proxies would be forced into memory
at all times.

However, you surely must have some kind of ROT, and that would indeed
really contain weak references to the (proxy) objects in memory. Is
that what you mean?

Yes.

Now let's talk about finalization. Everything begins with a GC of the
objects in memory. In particularly this includes the proxy objects.
The ROT only contains weak references so proxies may easily become
unreachable even though their associated objects on disk are not
garbage. This was discussed in your example with objects A,B,C.

A relevant question is whether this is a stop the world or incremental
collector and how it relates to the status of all the threads in the
process. Depending on the approach taken, it may be the case that
threads are in the middle of running their respective "algorithms" when
they are blocked by the tracing collector.

No, the threads access an object through strong references, therefore
nothing wrong can happen.

I'm only leading to the recovery to a consistent cut after power
failure scenario. If the process continues to run then there is no
problem as long as concurrency is handled properly.

Then I don't use asynchronous GC (I am not
convinced that it is a good idea). So I prefer reference counting based GC.

Ok, if you like. It still doesn't solve the problem of how to find a
consistent cut.


This raises the difficult
question of how to find a consistent cut. IMO the only efficient,
reasonable solution is to force threads to use explicit transactions in
order to access persistent objects. A transaction defines an atomic
unit of work, and is a basis for allowing the persistent objects to
recover after power failure to a consistent state.

This issue relates to the ideology of orthogonal persistence. I
believe the idea is flawed. It doesn't recognise the fundamental
dichotomy between independents and dependents. It can't be made
efficient. It is incompatible with schema evolution. FYI, an
interesting paper that addresses part of the problem is called
"Concurrency -= the fly in the ointment". Look it up in google if
you haven't read it already! What is you position on this?

Well, Java it is not a language where concurrency is approached in any more
or less consistent way. So my position could be well expressed by an
appropriate citation from Bob Badour... (:-))

Do you believe explicit transactions are a good idea?


Anyway, after performing a trace the collector has found some objects
in memory that are unreachable. You say these are then "finalized" in
the manner used by Java. The proxy objects are treated specially. At
this point, finalization (at least conceptually) triggers GC in the
storage space, possibly causing objects to be deleted from disk. This
is also where modifications of (proxy) objects in memory are
"synced" (ie written to disk).

Can you please explain in more detail how you ensure the system will
recover to a consistent cut?

"Recover" should refer to a definite class of errors. See for example:

http://www.cs.fsu.edu/~baker/ada/arm_95/RM-1-1-5.html

Otherwise it is like: how you ensure the system to recover after being
exploded by an atomic bomb? Answer: keep it on other continent.

I'm talking about something like a simple power failure that
doesn't damage the hardware. Well written software ensures recovery
back to a consistent cut.


In my system memory-resident C is
finalized. Upon this the DB GC is run to determine whether C has to be
removed from the DB. If B exists, C is synced to DB. If not, C is removed
from DB.

IMO there are two important views of the system, and therefore two
types of GC. There is a "transient view" which considers all objects
in memory (both independents and dependents), and a "persistent view"
which only considers the independents.

If you take the pure functional dependency perspective on the dichotomy
of dependents versus independents, then *by definition*, dependents
contain references to independents (or other dependents) but never the
reverse. Dependents only exist for the purpose of caching the results
of calculations. The ability to navigate from independents to
dependences simply exists to allow cached results to be found or to be
marked as dirty.

There is a theoretical sense in which all the dependents can be
completely eliminated from the system without impacting the program
functionality. After all, caching is only for performance. For that
reason, it is possible to persist the independents without the
dependents, and it is also possible to GC the independents without the
dependents. ie by specifically not following the navigation links from
independents toward dependents. These are exactly the "observer
pattern" links - ie when a subject is able to notify its observers by
storing pointers to them.

Yes. But note that the same object can migrate from one camp to another.
Unfortunately, it is impossible to stay fully functional, because some
objects are quite large for "cut'n'paste."

I see your argument, but nevertheless I don't agree. Can you back it
up with a more concrete example?

A training set of many thousands examples of hundred of features. Each
example is a vector of feature values. Basically a training set is a
relational table. When you make a set persist, you should also do all the
features it refers to. A feature is like an ADT. There are discrete,
continuous, linguistic, evaluated etc features with different domain sets.


Consider a component that calculates a (sampled) Mandelbrot image.
[...]

Yep, it is like an evaluated feature, which value is evaluated each time
you touch it, by classification of the current set using the classifier it
depends on. Basically it is a lazy formula. BTW, here the independence
analysis plays a crucial role. There are dynamic constraints which
determine the accuracy of evaluation. I.e. though all around is strictly
immutable, still you cannot evaluate it once. Constraints are propagating
along the edges of dependencies, as you navigate the DAG.

Now back to your claim that the dependents can migrate from one camp to
another: I don't think that's reasonable. As I see it, if the
user really wants an editable Mandelbrot image (eg that can be
"touched up") then the user needs to explicitly copy the calculated
image as a value type, into an Image component that represents a
"real" sampled image. In the Image component the pixel values now
represent independent values that can be edited by the user and need to
persist.

The image can be used in another image or as a background of a widget. Same
with features. A training set may include values of a classificatory
feature. You can put a formula into a spread*** column. That would make
the table dependent of the formula.

If a dependent is used by another dependent then it remains a
dependent. In my terminology a dependent has *inputs*. These can be
dependents or independents.

A dependent *never* becomes an independent. A dependent can be copied
in order to create an independent that initially has the same value.
However after that point they may quickly diverge.

Trying to reuse a dependent as an independent seems a bit rude!
That's like stealing a cached value out of a cache. Is that what
you're suggesting?


What I mean is that dependents include things like screen pixel values.
I assume you agree it would be very inefficient to persist snapshots
of the screen. How do you reconcile that?

I'm not sure what you mean. But bit blit is that thing. This is a different
theme, but I designed a widget library for real-time wave forms and list
views using this approach.

What about the calculated Mandelbrot image, in the example above? Do
you agree you may not want that to persist?

This is a wrong question. Persistence (or not) is a requirement.
Requirements are not discussed (normally (:-)). Whether an *implementation*
of the object is a persistent lazy formula or a persistent bitmap is
irrelevant to the mechanism of persistence. [ Here, again, your platonic
Weltanschauung hits you back. As long as both bitmap and a formula expose
the same behavior, I don't care.]

A client of an object doesn't care whether dependents persist.
However, IMO the framework provided persistence system has a more
intimate, low level relationship with the object. Nevertheless (and
perhaps paradoxically) the persistence system is of course layered
below the application classes.


Do you mean writing modified
persistent objects back to disk?

Yes, if the finalized object persists and was modified.

How do you ensure integrity if objects are only written to disk when
they are finalised? What if there is a power failure, so your data is
in an inconsistent state?

It is not, because if a reference to a proxy object is held , that means
that this object is potentially being modified. To sync it prematurely
would mean to bring the system in an inconsistent state.

I don't see that as being the problem.

Neither I do. For the persistent storage everything is all atomic as long
as at least one strong reference exists.

Yes I agree, but the underlying persistent system is ignorant of the
high level design and invariants. How does it ensure the system always
recovers to a consistent cut after a power failure?

See above. "Power failure" is not a term. You have first define it in the
terms of a [programming] language in which you describe the system as a
whole. Note that this necessarily includes the scope where the persistent
objects exist. That description can be analysed and then some questions can
be answered. How would it recover if I switch +5V and -12V stripes of the
RAID controller? What would happen after placing abrasive material under
the disk heads?

Let's be quite specific then. There is only one hard-disk which only
contains one read/write head. When the power fails no further data is
written to disk. This may even happen in the middle of writing a 512
byte disk sector. All previous writes up to that point remain intact.

I would expect well written software to guarantee recovery back to a
consistent cut without excessive data loss in that scenario.

Tell me how this can work with your approach!


I have. For all intensive purposes it is just as fast as a system
semaphore.

That's interesting. Do you use Windows or POSIX? Can you outline the
algorithm?

I use the semaphore in Win32. I last measured it some time ago on a
2GHz Pentium IV. It can be locked then unlocked in 2us. I used the
standard algorithm for exclusive write or shared read locks and
obtained the same performance because the time is dominated by the
semaphore. Note that once a shared read lock has already been gained,
subsequent shared read locks are granted far more quickly because they
only lock a Win32 critical section to protect the read count, and a
critical section is about 25 times faster than a semaphore on that
machine.

This is what I'am doing too. First take a critical section, then look at
the write count, increase the read count then release it. If write count is
set, then release and wait for an event. If write is required, then spin
for a mutex, look for the read count ... It is much slower than plain
critical sections. Especially because there are tricky places where race
conditions need to be pounded out. Someday I'll give it another try in Ada.

I don't have an explicit write count. I presume that exists within
the Win32 semaphore. This is only used to avoid having a thread with
exclusive access block itself if it nests calls to gain an exclusive
lock.

Here is the algo in pseudo C :-

int readCount = 0; // Number of readers
CriticalSection cs; // Protect readCount
Semaphore wr = 1; // exclusive writer or else multiple
readers

void write()
{
wait(wr); // lock out all readers, or another writer
< allow exclusive access >
signal(wr); // up for grabs
}

void read()
{
{
Lock lock(cs);
if (readCount++ == 0) wait(wr);
}
< allow read access >
{
Lock lock(cs);
if (--readCount == 0) signal(wr);
}
}

I see.

Note that a semaphore must be used, not a Win32 mutex because the
latter assumes that the thread that releases the mutex was the same
thread that locked it. That doesn't generally happen in shared read
modes because the first thread to get the read lock won't always be
the last thread that releases the read lock.

I.e. you still have critical section + semaphore. In one of my
implementations, a read access can be granted without entering critical
section. This is why the write count is used. A potential reader
incremented (atomically) the read count and then looked if it really gained
the access by looking at the counters. Then I want that a thread that
already has write access wouldn't be blocked on either read or write
access. Plus there is an integrated soft deadlock breaking. All together it
is too slow. In general, I don't like it as a primitive synchronization
object, it is too low level to my taste.

In the case of intervals the result of a*b depends on the joint
distribution of a and b over R. When they are independent, then the
distribution can be represented as a multiplicative of individual
distributions (min is the multiplication operation in that case). If a and
b are not independent (as in the case of x*x), then the result of * is
inaccurate (it remains an estimation from above, in some sense, which makes
interval arithmetic sound, but that's another story).

I don't know what point you're making here.

That immutability does not automatically solve all problems with
dependencies.

If you could formalize that better maybe you have a reason why FP
isn't so popular. At the moment I don't really know what you're
saying.

Cheers,
David Barrett-Lennard

.


Quantcast