Re: How have code analysis tools changed the way you work?

From: Nick Landsberg (hukolau_at_NOSPAM.att.net)
Date: 06/16/04

  • Next message: Nick Landsberg: "Re: Creating an operating system"
    Date: Wed, 16 Jun 2004 11:07:03 GMT
    
    

    James Rogers wrote:
    > Nick Landsberg <hukolau@NOSPAM.att.net> wrote in
    > news:%LMzc.73797$Gx4.18401@bgtnsc04-news.ops.worldnet.att.net:
    >
    >
    >>Pinto's aside, there *is* a market for highly
    >>reliable computing, and it's based on corporate
    >>revenue. (The almighty $$$ wins again!)
    >>
    >>Not more than 3 hours ago, I left a conference
    >>call where the customer questioned our proposals
    >>especialily regarding reliability. It seems we
    >>misinterpreted their request for 99.999% reliability.
    >>(That's 5 minutes a year application down-time,
    >>give or take a few seconds.)
    >
    >
    > When you have all that worked out please let us know
    > which OS provides 99.999% reliability. Very few OS
    > vendors will even mention reliability in anything
    > more than a qualitative (it has HIGH reliability)
    > manner. Microsoft did publish a study on their OS's
    > including NT and 2000. They were apparently proud that
    > Win 2000 has an MTBF of 2000 hours.
    >
    > I do not care how well you write your application.
    > It can never be more reliable than the OS it
    > executes on.

    Absolutely correct! But there is a form of safety
    in numbers. We run Solaris on Sun hardware.
    The O/S and hardware come up to about 8 hours
    a year down time (based on real-world statistics,
    not glossies). In the simplest case, we duplex
    the hardware and run the application on both
    boxes at about 35% utilization. When one
    fails, the other box picks up the slack running
    at about 70% utilization. In more complex cases,
    we do N+K sparing, where N machines can accomodate
    the expected load even if K boxes have failed.
    And we compute the appropriate K for each value
    of N based on the expected failure rates.

    Oh, and that 8 hours a year includes 4 hours
    per incident travel time for parts replacement,
    if necessary. It depends on what kind of service
    contract the customer is willing to pay for or how many
    spare parts they stock. If the customer is
    willing to pay for this level of reliability,
    they are usually willing to stock enough
    spare parts to actually build a machine on-site,
    although that has never been necessary in my
    experience.

    >
    > Will you be writing your own OS, database sytem,
    > networking, and domaine-specific application?
    > Will the product be delivered in this decade?
    > What special care will you take to ensure 99.999%
    > reliability or better for all your software? Do you
    > have experience that those techniques and tools will
    > actually produce your desired results? How will you
    > handle associated requirements such as preventive
    > maintenance of hardware, upgrades to software,
    > upgrades to security, file system backups, and so forth,
    > while maintaining 99.999% uptime on all systems?

    Regarding software reliability, part of the answer is
    in the above. Part of the answer is also that
    we have a preliminary 72-hour stability test
    at full busy-hour load. Exit criteria include
    such things as no abnormal terminations (core dumps)
    and zero memory growth. Only then do we start the
    14 day stability run.

    Regarding the maintenance and such, the 5-9's
    requirement does not include planned down time.
    We have 3 hours every three months in which to
    do software upgrades, hardware PM, etc.
    Backups are done live with techniques like
    rotating log files before backup, having
    multiple database checkpoints on disk, etc.

    We use mostly off-the-shelf components (e.g.
    the DBMS) but have put a wrapper around them
    to do restart and recovery in our proprietary
    fashion.

    All of that has already been factored in to the
    arithmetic for the reliability.

    >
    > What is the expected operational lifetime of such a
    > system? Will the development effort take longer than
    > the lifetime of the system? How will such a system
    > accomodate changes in application load over the life
    > of the system? Will you be able to increase system
    > capabilities over time with no downtime?

    Operational lifetime: several (5-8) years. The
    infrastructure development is done and has been
    stable for years. It's been ported several
    times to more modern hardware and OS's in the
    past. The infrastructure includes a feature for
    "software hot slide" (We have guidelines
    for writing the software to allow for a hot slide
    and hot-slide is a part of the testing process.)
    We even accomodate limited database schema
    changes between versions of software.
    Additional load is handled by adding additional
    hardware. The software must tolerate having
    multiple instances of itself running on the
    same local network or even on the same box.
    We also enforce a load-shedding strategy when
    the box goes into overload, e.g. if CPU > X%,
    refuse Y% of new requests for Z minutes. Repeat
    or escalate if necessary.

    >
    > Do the reliability requirements account for natural
    > disasters such as earthquakes on the left coast or
    > hurricanes on the right coast, or are they calculated
    > solely against hardware and software failures due to
    > mechanical wear and software defects? What about power
    > failures on the left coast? Do you need UPS systems
    > that can operate indefinitely in the event of a power
    > grid failure?

    Natural disasters - yes.

    Power Failures - yes, to a limited extent. Our
    customers provide the UPS's since they are the
    ones concerned about reliability. A system
    which is supposed to act as failover/failsafe
    for another one is wired to a different UPS.
    There is also battery backup provided (these
    critters run on 48 Volts DC) and there is a
    generator in the basement. We rejected a certain
    manufacturer's RAID arrays once, because altough
    they had 3 power supplies, they only had a single
    power cord.

    We also have statistics which indicate that
    only 20% of system downtime is due to hardware
    failures, while 40% is due to software and
    another 40% is due to operator error. All
    of this is factored in.

    >
    > Your requirements sound like requirements for a space
    > vehicle. Most space vehicles have a lifetime from 5
    > to 20 years with no hope of upgrading or replacing
    > hardware. The hardware alone for such a space
    > vehicle can easily reach or exceed $500,000,000.00.
    > If the software fails the system reliability suffers
    > badly. I have heard rumors of software achieveing
    > 99.999% reliability. I have never seen proof that such
    > software acutally exists.

    ::grin:: You use such a system every day, Jim.

    Your telephone is connected to a switch which is
    nothing more than a computer system consisting
    of hardware and software which completes your
    calls. 5-9's (99.999%) reliability is the *entry*
    criteria into the market.

    NPL

    P.S. - I didn't mean to give the impression
    that it's *easy* to deliver this kind of system.
    Fully 80% (or more) of the code is the infrastructure
    which keeps the other 20% running in case
    of failures.

    -- 
    "It is impossible to make anything foolproof
    because fools are so ingenious"
      - A. Bloch
    

  • Next message: Nick Landsberg: "Re: Creating an operating system"

    Relevant Pages

    • Re: Ping Ed Rasimus
      ... But if the failures were that ... missiles were subjected to prior to employment. ... IMHO the hardware was ok, ... took a toll on reliability, the biggest problems for good AIM-7 shots ...
      (rec.aviation.military)
    • Re: How have code analysis tools changed the way you work?
      ... We run Solaris on Sun hardware. ... > previous message they want 99.999% reliability from ... >>contract the customer is willing to pay for or how many ... That's why the geographic redundancy requirement ...
      (comp.programming)
    • Re: How have code analysis tools changed the way you work?
      ... >> which OS provides 99.999% reliability. ... We run Solaris on Sun hardware. ... > contract the customer is willing to pay for or how many ... load-balanced system including geographic redundancy ...
      (comp.programming)
    • Re: Why Cant Software Design Be Like Hardware Design?
      ... Linux for example is IME as reliable as the hardware it runs on - ... It also matters how one measures reliability. ... linux software failures to hardware failures on a file server is pretty ...
      (comp.object)
    • Re: cpu overheating
      ... Mike McCarty wrote: ... a defect in Linux which occurs only very infrequently, ... Hardware should not overheat whatever software does, ... reliability in software to be important is full of idiots. ...
      (Fedora)