Re: A critique of test-first...

From: Nick Landsberg (SPAMhukolauTRAP_at_SPAMworldnetTRAP.att.net)
Date: 11/20/04


Date: Sat, 20 Nov 2004 13:52:19 GMT

CTips wrote:

> Nick Landsberg wrote:
>
>> CTips wrote:
>>
>>> Andrew McDonagh wrote:
>>>
>>>> CTips wrote:
>>>>
>>>> snipped
>>>>
>>>>> I suspect that I'd call your program on the small size. Unless of
>>>>> course, the program is a distributed fault-tolerant program with
>>>>> support for things like recovery after network partitioning. Or
>>>>> unless it requires switch-level 5-9s reliability. In which case it
>>>>> definitely counts as a medium sized program.
>>>>>
>>>>
>>>> It is a distributed fault tolerant system, and requires 5-9s
>>>> availability - not reliability.
>>>>
>>>> Strange, I'd never use 5-9s availability as a measurement of
>>>> application size, cause I can get that with a simple HelloWorld app.
>>>
>>>
>>>
>>>
>>> Remember the word "switch-level 5-9s reliability" - that means that
>>> the service will be unavailable something like 9 hours per year; this
>>> will include any scheduled maintainence on the servers, as well as
>>> any software updates, apart from the usual possibility of actual
>>> computer and/or network failures.
>>
>>
>>
>> 5-9's is a little over 5 minutes a year.
>
>
> Yeah...sorry, I had confused 3-9s and 5-9s - my bad.
>
>> Not achievable on a single system with commercially
>> available hardware that I know of. A distributed
>> FT system is the way to go (as noted below).
>
>
> Well, it needs to be parallel yes; distributed makes the problem a whole
> lot harder (by distributed I mean multiple machines connected by a
> network, by parallel I mean multiple CPUs in the same nest). In a single
> machine with multiple CPUs (such as a zSeries mainframe), the machine is
> a controlled environment which can be exactly characterized for MTBF,
> usually by the manufacturer.
>
> Clusters are somewhat similar, except that the characterization has to
> take into account the network, but presumably the network is completely
> under your control and is not likely to change.
>
> In a distributed environment the environment is much less fixed. There
> may be hardware on the links which are not under your control. The links
> themselves may change. Roughly speaking in the first two cases, the
> system is synchronous, while in the distributed case, the system is
> ansynchronous. That causes both theoretical and practical difficulties.
>

Tell me about it. You're preaching to the choir here (at
least in my case). Simple Markov models may be developed
for such cases which, in turn, drive the "implied"
requirements, e.g. MTBF and MTTR (mean time between
failure and mean time to restore). Testing MTBF is
a bear, but testing MTTR is eminently possible through
fault injection techniques.

>>>
>>> Its in the context of what stresses it puts on the system; if you're
>>> designing for 5-9 availability on a distributed fault-tolerant system
>>> [even if you're building on top of some framework like Horus/Isis],
>>> it gets pretty complex.
>>>
>>> Some of the nastier problems
>>> - what happens if the network partitions, and someone makes an
>>> update? How do you reconcile the information when they rejoin?
>>> - How about things like a bad router table somewhere?
>>> - What kind of failure detectors are you using?
>>
>>
>>
>> Let's add "fault isolation" to your "fault detection"
>> line above. Also add "automatic failover" and "failback"
>> (without hysteresis). Reboot is NOT an option.
>>
>>> - Are you dealing with Byzantine failure models? What resilience are
>>> you targeting?
>>> - What happens when A can talk to B and C (and vice versa), but the
>>> link between B & C is very slow?
>>>
>>> Hats off to you if you've already built the infrastructure to test
>>> some of those corner cases, let alone derive the code from them.
>>
>>
>>
>> Amen to that remark, CTips.
>>
>> These are examples of what some
>> folks call "non-functional" requirements and others
>> call "supra-functional" requirements. Others call
>> them "implied" and "derived" requirements based on
>> customer expectations. If you're working on a UI
>> on an unstable OS in the first place, they will
>> probably curse the OS rather than you.
>
>
> <Gasp> No, really - someones going to use Windows for 5-9 stuff?

Does the phrase "over my dead body" ring a bell?

>
>> I will say, however, that 99.99% of developers may never
>> have to program in this environment, so it, itself, may
>> be a "corner case." At the 5-9's level that you are talking
>> about, testing is absolutely essential, not just unit
>> testing, but stress testing and stability testing.
>> (Two related but subtly different things.) I also
>> fail to see how unit testing alone can drive the code
>> for this case. I feel that this case needs up-front design
>> rather than a reactive approach.
>
>
> However Andrew McDonagh's team has either achieved 5-9 availability with
> TDD, or feel that they will get there by the time they finish. It would
> be interesting to hear how they spec'd the machines and network to get
> to that level - are they using dedicated, non-IP connections between
> zSeries or Tandem machines? And it would be equally interesting to hear
> about the analysis they had to do to hit that.

I would also be interested in seeing the results of that.

>
> Also, they're using Java in at least part of their apps. Thats an
> interesting choice. One hopes that they will use compiled java with
> compiled libraries statically linked in, or are using controlled /
> dedicated systems. IMHO, there is very little chance of hitting 5-9s
> using a JVM (except on a dedicated system) - too much chance of someone
> changing the operating enviornment.

On a dedicated system, I have seen at least one project achieve
that (after beating on SUN to fix a bug in the GC to no avail,
and spend many staff months to find the memory leak and fix it.)
Even so, they had to restart the JVM once every three months,
30 seconds restart time. The customer grudgingly accepted the
"solution" since it could be scheduled, rather than happen during
busy hour.

NPL

-- 
"It is impossible to make anything foolproof
because fools are so ingenious"
  - A. Bloch


Relevant Pages

  • Re: A critique of test-first...
    ... network, by parallel I mean multiple CPUs in the same nest). ... In a distributed environment the environment is much less fixed. ... may be hardware on the links which are not under your control. ... be interesting to hear how they spec'd the machines and network to get ...
    (comp.programming)
  • Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel
    ... it depends on your environment. ... using Linux systems where trusting the network makes sense. ... example, if you're on a battleship, say, or all of the machines are in ... across a PCI bus secure even though they aren't encrypted? ...
    (Linux-Kernel)
  • Re: How can I prevent users from accessing ressources accross the domain
    ... Since you have control over the OU ... Ultimately access to domain resources needs to be controlled by ... Word document and then enter a unc path in it to try to access a network ... > -Users log on the network, not on the local machines. ...
    (microsoft.public.win2000.security)
  • Re: Windows Bot/Trojan/Backdoor scanner
    ... network because they are exhibiting suspicious network activity as ... For administratively controlled machines (depending on the administrating ... port blocking at the network gateway. ... - Automatically Control P2P, IM and Spam Traffic ...
    (Security-Basics)
  • Re: net view
    ... > Or how could I find all computers on my network? ... SMB based queries that windoze machines respond to is a windoze specific ... can query machines on the network but this is VERY environment specific. ...
    (alt.os.linux)