Re: How have code analysis tools changed the way you work?
From: Nick Landsberg (hukolau_at_NOSPAM.att.net)
Date: 06/16/04
- Previous message: Richard Harter: "Re: Creating an operating system"
- In reply to: James Rogers: "Re: How have code analysis tools changed the way you work?"
- Next in thread: James Rogers: "Re: How have code analysis tools changed the way you work?"
- Reply: James Rogers: "Re: How have code analysis tools changed the way you work?"
- Reply: Programmer Dude: "Re: How have code analysis tools changed the way you work?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 16 Jun 2004 11:07:03 GMT
James Rogers wrote:
> Nick Landsberg <hukolau@NOSPAM.att.net> wrote in
> news:%LMzc.73797$Gx4.18401@bgtnsc04-news.ops.worldnet.att.net:
>
>
>>Pinto's aside, there *is* a market for highly
>>reliable computing, and it's based on corporate
>>revenue. (The almighty $$$ wins again!)
>>
>>Not more than 3 hours ago, I left a conference
>>call where the customer questioned our proposals
>>especialily regarding reliability. It seems we
>>misinterpreted their request for 99.999% reliability.
>>(That's 5 minutes a year application down-time,
>>give or take a few seconds.)
>
>
> When you have all that worked out please let us know
> which OS provides 99.999% reliability. Very few OS
> vendors will even mention reliability in anything
> more than a qualitative (it has HIGH reliability)
> manner. Microsoft did publish a study on their OS's
> including NT and 2000. They were apparently proud that
> Win 2000 has an MTBF of 2000 hours.
>
> I do not care how well you write your application.
> It can never be more reliable than the OS it
> executes on.
Absolutely correct! But there is a form of safety
in numbers. We run Solaris on Sun hardware.
The O/S and hardware come up to about 8 hours
a year down time (based on real-world statistics,
not glossies). In the simplest case, we duplex
the hardware and run the application on both
boxes at about 35% utilization. When one
fails, the other box picks up the slack running
at about 70% utilization. In more complex cases,
we do N+K sparing, where N machines can accomodate
the expected load even if K boxes have failed.
And we compute the appropriate K for each value
of N based on the expected failure rates.
Oh, and that 8 hours a year includes 4 hours
per incident travel time for parts replacement,
if necessary. It depends on what kind of service
contract the customer is willing to pay for or how many
spare parts they stock. If the customer is
willing to pay for this level of reliability,
they are usually willing to stock enough
spare parts to actually build a machine on-site,
although that has never been necessary in my
experience.
>
> Will you be writing your own OS, database sytem,
> networking, and domaine-specific application?
> Will the product be delivered in this decade?
> What special care will you take to ensure 99.999%
> reliability or better for all your software? Do you
> have experience that those techniques and tools will
> actually produce your desired results? How will you
> handle associated requirements such as preventive
> maintenance of hardware, upgrades to software,
> upgrades to security, file system backups, and so forth,
> while maintaining 99.999% uptime on all systems?
Regarding software reliability, part of the answer is
in the above. Part of the answer is also that
we have a preliminary 72-hour stability test
at full busy-hour load. Exit criteria include
such things as no abnormal terminations (core dumps)
and zero memory growth. Only then do we start the
14 day stability run.
Regarding the maintenance and such, the 5-9's
requirement does not include planned down time.
We have 3 hours every three months in which to
do software upgrades, hardware PM, etc.
Backups are done live with techniques like
rotating log files before backup, having
multiple database checkpoints on disk, etc.
We use mostly off-the-shelf components (e.g.
the DBMS) but have put a wrapper around them
to do restart and recovery in our proprietary
fashion.
All of that has already been factored in to the
arithmetic for the reliability.
>
> What is the expected operational lifetime of such a
> system? Will the development effort take longer than
> the lifetime of the system? How will such a system
> accomodate changes in application load over the life
> of the system? Will you be able to increase system
> capabilities over time with no downtime?
Operational lifetime: several (5-8) years. The
infrastructure development is done and has been
stable for years. It's been ported several
times to more modern hardware and OS's in the
past. The infrastructure includes a feature for
"software hot slide" (We have guidelines
for writing the software to allow for a hot slide
and hot-slide is a part of the testing process.)
We even accomodate limited database schema
changes between versions of software.
Additional load is handled by adding additional
hardware. The software must tolerate having
multiple instances of itself running on the
same local network or even on the same box.
We also enforce a load-shedding strategy when
the box goes into overload, e.g. if CPU > X%,
refuse Y% of new requests for Z minutes. Repeat
or escalate if necessary.
>
> Do the reliability requirements account for natural
> disasters such as earthquakes on the left coast or
> hurricanes on the right coast, or are they calculated
> solely against hardware and software failures due to
> mechanical wear and software defects? What about power
> failures on the left coast? Do you need UPS systems
> that can operate indefinitely in the event of a power
> grid failure?
Natural disasters - yes.
Power Failures - yes, to a limited extent. Our
customers provide the UPS's since they are the
ones concerned about reliability. A system
which is supposed to act as failover/failsafe
for another one is wired to a different UPS.
There is also battery backup provided (these
critters run on 48 Volts DC) and there is a
generator in the basement. We rejected a certain
manufacturer's RAID arrays once, because altough
they had 3 power supplies, they only had a single
power cord.
We also have statistics which indicate that
only 20% of system downtime is due to hardware
failures, while 40% is due to software and
another 40% is due to operator error. All
of this is factored in.
>
> Your requirements sound like requirements for a space
> vehicle. Most space vehicles have a lifetime from 5
> to 20 years with no hope of upgrading or replacing
> hardware. The hardware alone for such a space
> vehicle can easily reach or exceed $500,000,000.00.
> If the software fails the system reliability suffers
> badly. I have heard rumors of software achieveing
> 99.999% reliability. I have never seen proof that such
> software acutally exists.
::grin:: You use such a system every day, Jim.
Your telephone is connected to a switch which is
nothing more than a computer system consisting
of hardware and software which completes your
calls. 5-9's (99.999%) reliability is the *entry*
criteria into the market.
NPL
P.S. - I didn't mean to give the impression
that it's *easy* to deliver this kind of system.
Fully 80% (or more) of the code is the infrastructure
which keeps the other 20% running in case
of failures.
-- "It is impossible to make anything foolproof because fools are so ingenious" - A. Bloch
- Previous message: Richard Harter: "Re: Creating an operating system"
- In reply to: James Rogers: "Re: How have code analysis tools changed the way you work?"
- Next in thread: James Rogers: "Re: How have code analysis tools changed the way you work?"
- Reply: James Rogers: "Re: How have code analysis tools changed the way you work?"
- Reply: Programmer Dude: "Re: How have code analysis tools changed the way you work?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|