Re: How have code analysis tools changed the way you work?

From: Nick Landsberg (hukolau_at_NOSPAM.att.net)
Date: 06/17/04


Date: Thu, 17 Jun 2004 14:00:52 GMT

James Rogers wrote:

> Nick Landsberg <hukolau@NOSPAM.att.net> wrote in
> news:rRVzc.21630$Di3.6291@bgtnsc05-news.ops.worldnet.att.net:
>
>
>>James Rogers wrote:
>>
>>>Nick Landsberg <hukolau@NOSPAM.att.net> wrote in
>>>news:%LMzc.73797$Gx4.18401@bgtnsc04-news.ops.worldnet.att.net:
>>>
>>>
>>>
>>>>Pinto's aside, there *is* a market for highly
>>>>reliable computing, and it's based on corporate
>>>>revenue. (The almighty $$$ wins again!)
>>>>
>>>>Not more than 3 hours ago, I left a conference
>>>>call where the customer questioned our proposals
>>>>especialily regarding reliability. It seems we
>>>>misinterpreted their request for 99.999% reliability.
>>>>(That's 5 minutes a year application down-time,
>>>>give or take a few seconds.)
>>>
>>>
>>>When you have all that worked out please let us know
>>>which OS provides 99.999% reliability. Very few OS
>>>vendors will even mention reliability in anything
>>>more than a qualitative (it has HIGH reliability)
>>>manner. Microsoft did publish a study on their OS's
>>>including NT and 2000. They were apparently proud that
>>>Win 2000 has an MTBF of 2000 hours.
>>>
>>>I do not care how well you write your application.
>>>It can never be more reliable than the OS it
>>>executes on.
>>
>>Absolutely correct! But there is a form of safety
>>in numbers. We run Solaris on Sun hardware.
>>The O/S and hardware come up to about 8 hours
>>a year down time (based on real-world statistics,
>>not glossies). In the simplest case, we duplex
>>the hardware and run the application on both
>>boxes at about 35% utilization. When one
>>fails, the other box picks up the slack running
>>at about 70% utilization. In more complex cases,
>>we do N+K sparing, where N machines can accomodate
>>the expected load even if K boxes have failed.
>>And we compute the appropriate K for each value
>>of N based on the expected failure rates.
>
>
> And this is where you seem to have a difference of
> understanding with your customers. According to your
> previous message they want 99.999% reliability from
> each node in your solution. The result, of course,
> would be much higher than 5-9's for a geographically
> distributed redundant system.
>

Yes, you are correct again. The original RFI
said 99.999% but did not say "each site."
This makes the problem more difficult, mainly
in the area of database replication. Not
unsolveable, just more difficult and more
costly.

>
>>Oh, and that 8 hours a year includes 4 hours
>>per incident travel time for parts replacement,
>>if necessary. It depends on what kind of service
>>contract the customer is willing to pay for or how many
>>spare parts they stock. If the customer is
>>willing to pay for this level of reliability,
>>they are usually willing to stock enough
>>spare parts to actually build a machine on-site,
>>although that has never been necessary in my
>>experience.
>>
>
>
> Of course, 8 hours a year is not the 5 minutes per
> year you mentioned earlier. You cover this with
> redundancy, which works as long as the system does
> not fail from a common cause (such as all the
> systems being damaged by the same natural disaster).

That's why the geographic redundancy requirement
is there. Our customers understand that. We also
impressed upon them (long ago) that they have to
live with the transactins in progress being lost
when the system goes down because replaying those
same transactions on the other side might cause
the same thing to happen there.

A curious anomally is that one of our larger
customers has their operations center sitting
right on top of the Hayward Fault (near San
Francisco). One thinks that they should have
considered that *before* choosing the site
for their operations center. Oh well.

>
>
>>Regarding the maintenance and such, the 5-9's
>>requirement does not include planned down time.
>>We have 3 hours every three months in which to
>>do software upgrades, hardware PM, etc.
>
>
> I suspected as much. There are a number of ways to
> measure up-time. It is common to claim that planned
> down-time is not a reduction in reliability. It is
> also common for a customer to accept that logic. In
> practice this exception can become a very large
> loop-hole. For instance, when it becomes clear that
> a system is about to fail, I have seen organizations
> declare an "emergency" planned down-time. Since it is
> planned, it is excluded from reliability statistics.
> I have also seen systems with terrible reliability
> use a similar strategy. One system, using an older
> version of Microsoft Windows, experienced frequent
> and serious memory leaks. Their solution was to
> reboot the systems every week. Since the reboot was
> scheduled they claimed to have improved the reliability
> of the system. The memory leaks caused system failure
> about every 3 weeks. The solution actually reduced the
> availability of the system more than the problem, but
> the reduction was ignored because the down-time was
> planned.

Yep, upon occassion we have done that particular
dance too, but there is only so long that you
can do it before it gets old. We had a similar
memory leak issue with one piece of Java code.
(Bug in the GC.) We were losing about 2 MB a day.
The JVM heap was sized at 512 MB, but normal operation
was at 256 MB. Increasing it beyond that would
have caused paging activity which would impact
response times. (Average 300 ms., with 500 ms.
or less 95% of the time.)

One of our managers tried to convince that particular
customer that it was no big deal since that would
only require a reboot every 90 days or so, but the
customer would not listen, even though the JVM
restart would only take a minute and could be
scheduled during low traffic periods. The
remark which cut the deepest was something along
the lines of "we expected a carrier grade solution,
not a PC-grade solution."

In our case, the "planned downtime" is most often
spelled out up front. e.g. a 3 hour maintenance window
every 3 months. This includes any software updates
which cannot be accomplished with "hot slide"
and any database schema changes. The customer
gets to declare when planned downtime happens.
We have no say in the matter. The customers'
technicians do all the maintenance and upgrade
activities themselves, according to the
documentation which we must produce for them.
If it doesn't work right, the Vice-President
usually gets an irate phone call.

>
>
>>Operational lifetime: several (5-8) years. The
>>infrastructure development is done and has been
>>stable for years. It's been ported several
>>times to more modern hardware and OS's in the
>>past. The infrastructure includes a feature for
>>"software hot slide" (We have guidelines
>>for writing the software to allow for a hot slide
>>and hot-slide is a part of the testing process.)
>>We even accomodate limited database schema
>>changes between versions of software.
>>Additional load is handled by adding additional
>>hardware. The software must tolerate having
>>multiple instances of itself running on the
>>same local network or even on the same box.
>>We also enforce a load-shedding strategy when
>>the box goes into overload, e.g. if CPU > X%,
>>refuse Y% of new requests for Z minutes. Repeat
>>or escalate if necessary.
>>
>
>
> I am convinced that you have a very nice distributed,
> load-balanced system including geographic redundancy
> and automatic fail-over / fail-back capabilities.

Thank you. Much appreciated. :)

We're still trying to get better at it, too.

>
>
>>>Your requirements sound like requirements for a space
>>>vehicle. Most space vehicles have a lifetime from 5
>>>to 20 years with no hope of upgrading or replacing
>>>hardware. The hardware alone for such a space
>>>vehicle can easily reach or exceed $500,000,000.00.
>>>If the software fails the system reliability suffers
>>>badly. I have heard rumors of software achieveing
>>>99.999% reliability. I have never seen proof that such
>>>software acutally exists.
>>
>>::grin:: You use such a system every day, Jim.
>>
>>Your telephone is connected to a switch which is
>>nothing more than a computer system consisting
>>of hardware and software which completes your
>>calls. 5-9's (99.999%) reliability is the *entry*
>>criteria into the market.
>
>
> I was thinking of the customer's requirement for
> 5-9's at each node.

As noted above, it was unclear in the RFI, so that
now we have to come up with a slightly different
(costlier) proposal, once we figure out how to do
multi-way data replication without having
performance go into the dumper or segment the
database in some creative manner.

>
> I have worked in the telecom industry. I understand
> their use of multiple redundancy and automatic fail-over
> systems.

I gather that, after this exchange, you
believe that our outfit also understands
them fairly well. Actually, I wasn't at all taken
aback by your initial post, Jim. Questioning
how it could be done, or more precisely, questioning
if we knew just how big a thing we were biting
off. You asked the same kinds of questions I would
have asked, had someone (who I did not know) claimed to be
able to get to that level of reliability.

When all you've ever played in is a sandbox,
it's kind of hard to imagine the amount of
sand in the desert. Obviously, you yourself
have been in the desert before, too. See
you at the Oasis :)

>
> That is not quite the same as a single application process
> achieving 99.999% reliability. Your solution may achieve many
> of the same goals for the customer, but it does require a
> significant investment in facilities, hardware, networking,
> and redundant staffing at geographically distributed locations.

Absolutely no argument there.

Given that, at best, commercial hardware/software
is only 3-9's, you do need the significant
investment to get to 5-9's

A quick computation of some other clusters
we have provided indicated a revenue loss
to the customer of about $100,000 per hour of
downtime if an outage happened during busy hour.
So, there *is* a business case for this
degree of reliability and our customers
realize that it costs a whole lot of money
to guard against failures.

>
>
>>NPL
>>
>>P.S. - I didn't mean to give the impression
>>that it's *easy* to deliver this kind of system.
>>Fully 80% (or more) of the code is the infrastructure
>>which keeps the other 20% running in case
>>of failures.
>
>
> Yes, the old 80/20 rule.
>
> Do you also provide a segregation of business-critical
> functionality from non-critical functionality? In a
> telecom example this would allow the network to continue
> delivering service without interruption even if the
> company HR systems went down for a day.
>
> Extremely high availability is costly. It makes most
> sense for a company to run only critical applications
> on such a system, leaving the less critical applications
> on cheaper, and less reliable, systems.
>
> Jim Rogers

That's what all of our customers do (segregate
mission critical and non-critical apps). These systems
are usually dedicated to a single logical application per
cluster (multiple procs). The HR stuff runs somewhere else.
Even so, as part of the overload control sub-system
mentioned above, the escalation strategy provides
the capability to shutdown or delay non-critical
processes, e.g. non-essential statistics gathering,
hourly reports and such.

Thanks for the feedback, Jim

NPL.

-- 
"It is impossible to make anything foolproof
because fools are so ingenious"
  - A. Bloch


Relevant Pages

  • Re: How have code analysis tools changed the way you work?
    ... >>misinterpreted their request for 99.999% reliability. ... We run Solaris on Sun hardware. ... contract the customer is willing to pay for or how many ... > solely against hardware and software failures due to ...
    (comp.programming)
  • Re: How have code analysis tools changed the way you work?
    ... >> which OS provides 99.999% reliability. ... We run Solaris on Sun hardware. ... > contract the customer is willing to pay for or how many ... load-balanced system including geographic redundancy ...
    (comp.programming)
  • Re: 2008 SBS no longer boots
    ... driver issue initially, the fact that it seems to happen randomly now ... Hardware tests don't turn up anything, SBS 2008 CD boots fine, running ... other customer, even a 2 week old one, the problem immediately returned. ... can provide clean power to the server and resolve the issue. ...
    (microsoft.public.windows.server.sbs)
  • Re: 2008 SBS no longer boots
    ... Any thermal issues and a lot of other hardware problems should show up ... Happen to know a tool to determine driver load order by just having the ... other customer, even a 2 week old one, the problem immediately returned. ... There was a time when a server (even one that had been "burned ...
    (microsoft.public.windows.server.sbs)
  • Re: Why Cant Software Design Be Like Hardware Design?
    ... Linux for example is IME as reliable as the hardware it runs on - ... It also matters how one measures reliability. ... linux software failures to hardware failures on a file server is pretty ...
    (comp.object)