Re: A critique of test-first...
From: CTips (ctips_at_bestweb.net)
Date: 11/20/04
- Next message: Siddharth Taneja: "puzzle links?"
- Previous message: democratix: "Re: Supershell (Windows XP, VB)"
- In reply to: Nick Landsberg: "Re: A critique of test-first..."
- Next in thread: Nick Landsberg: "Re: A critique of test-first..."
- Reply: Nick Landsberg: "Re: A critique of test-first..."
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Fri, 19 Nov 2004 22:35:10 -0500
Nick Landsberg wrote:
> CTips wrote:
>
>> Andrew McDonagh wrote:
>>
>>> CTips wrote:
>>>
>>> snipped
>>>
>>>> I suspect that I'd call your program on the small size. Unless of
>>>> course, the program is a distributed fault-tolerant program with
>>>> support for things like recovery after network partitioning. Or
>>>> unless it requires switch-level 5-9s reliability. In which case it
>>>> definitely counts as a medium sized program.
>>>>
>>>
>>> It is a distributed fault tolerant system, and requires 5-9s
>>> availability - not reliability.
>>>
>>> Strange, I'd never use 5-9s availability as a measurement of
>>> application size, cause I can get that with a simple HelloWorld app.
>>
>>
>>
>> Remember the word "switch-level 5-9s reliability" - that means that
>> the service will be unavailable something like 9 hours per year; this
>> will include any scheduled maintainence on the servers, as well as any
>> software updates, apart from the usual possibility of actual computer
>> and/or network failures.
>
>
> 5-9's is a little over 5 minutes a year.
Yeah...sorry, I had confused 3-9s and 5-9s - my bad.
> Not achievable on a single system with commercially
> available hardware that I know of. A distributed
> FT system is the way to go (as noted below).
Well, it needs to be parallel yes; distributed makes the problem a whole
lot harder (by distributed I mean multiple machines connected by a
network, by parallel I mean multiple CPUs in the same nest). In a single
machine with multiple CPUs (such as a zSeries mainframe), the machine is
a controlled environment which can be exactly characterized for MTBF,
usually by the manufacturer.
Clusters are somewhat similar, except that the characterization has to
take into account the network, but presumably the network is completely
under your control and is not likely to change.
In a distributed environment the environment is much less fixed. There
may be hardware on the links which are not under your control. The links
themselves may change. Roughly speaking in the first two cases, the
system is synchronous, while in the distributed case, the system is
ansynchronous. That causes both theoretical and practical difficulties.
>>
>> Its in the context of what stresses it puts on the system; if you're
>> designing for 5-9 availability on a distributed fault-tolerant system
>> [even if you're building on top of some framework like Horus/Isis], it
>> gets pretty complex.
>>
>> Some of the nastier problems
>> - what happens if the network partitions, and someone makes an update?
>> How do you reconcile the information when they rejoin?
>> - How about things like a bad router table somewhere?
>> - What kind of failure detectors are you using?
>
>
> Let's add "fault isolation" to your "fault detection"
> line above. Also add "automatic failover" and "failback"
> (without hysteresis). Reboot is NOT an option.
>
>> - Are you dealing with Byzantine failure models? What resilience are
>> you targeting?
>> - What happens when A can talk to B and C (and vice versa), but the
>> link between B & C is very slow?
>>
>> Hats off to you if you've already built the infrastructure to test
>> some of those corner cases, let alone derive the code from them.
>
>
> Amen to that remark, CTips.
>
> These are examples of what some
> folks call "non-functional" requirements and others
> call "supra-functional" requirements. Others call
> them "implied" and "derived" requirements based on
> customer expectations. If you're working on a UI
> on an unstable OS in the first place, they will
> probably curse the OS rather than you.
<Gasp> No, really - someones going to use Windows for 5-9 stuff?
> I will say, however, that 99.99% of developers may never
> have to program in this environment, so it, itself, may
> be a "corner case." At the 5-9's level that you are talking
> about, testing is absolutely essential, not just unit
> testing, but stress testing and stability testing.
> (Two related but subtly different things.) I also
> fail to see how unit testing alone can drive the code
> for this case. I feel that this case needs up-front design
> rather than a reactive approach.
However Andrew McDonagh's team has either achieved 5-9 availability with
TDD, or feel that they will get there by the time they finish. It would
be interesting to hear how they spec'd the machines and network to get
to that level - are they using dedicated, non-IP connections between
zSeries or Tandem machines? And it would be equally interesting to hear
about the analysis they had to do to hit that.
Also, they're using Java in at least part of their apps. Thats an
interesting choice. One hopes that they will use compiled java with
compiled libraries statically linked in, or are using controlled /
dedicated systems. IMHO, there is very little chance of hitting 5-9s
using a JVM (except on a dedicated system) - too much chance of someone
changing the operating enviornment.
- Next message: Siddharth Taneja: "puzzle links?"
- Previous message: democratix: "Re: Supershell (Windows XP, VB)"
- In reply to: Nick Landsberg: "Re: A critique of test-first..."
- Next in thread: Nick Landsberg: "Re: A critique of test-first..."
- Reply: Nick Landsberg: "Re: A critique of test-first..."
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|