Re: "Am I still working okay?" asked the micro controller...

From: Don Taylor (dont_at_agora.rdrop.com)
Date: 05/19/04


Date: Wed, 19 May 2004 14:25:45 -0500


"SelfTest" <SelfTEst> writes:
>Say we have a micro controller with limited memory.
>Say it will perform some realtime control of something.

>How to make a SW for a micro controller, that in addition to its normal
>operation (control of something), from time to time it will also check
>itself if it is doing okay or not ? How a program can test itself? Can
>some one suggest any intelligent method (other than watch dog) ?

I worked on a project substantially larger than a single microcontroller
but the idea we applied might be appropriate. We took a very hard line
on this and the charter of the group was that there were going to be no
bugs delivered to the customers. In some of the functions that we wrote
it was feasible to write one, or a small number, of "sanity checks",
small tests that would evaluate whether arguments being passed and/or
state variables had values that were appropriate at the moment.

If a sanity check failed we displayed "Fatal Error nnnnn", where nnnnn
was the program counter at the point where the check failed, and then
we halted the processor.

This had a number of interesting and sometimes unexpected consequences.
The first was that it quickly became the case that nobody wanted to be
the one responsible for passing bad data to someone else's sanity check.
That seemed to result in people being much more careful that they would
not pass bad data. Secondly, it became a very popular thing for people
to carefully craft these checks to keep themselves from being responsible
for a failure. Thirdly, in an embedded environment when everyone is in
a panic to get all the work done, it seems that when the box just locks
up and you know it is going to take hours to try to figure out what just
happened, it seems much more reasonable to just hit the reset button and
try to get on with your own work. But when "Fatal Error nnnnn" pops up
and in seconds you can look at the build file and tell exactly where
the error happened and what sanity check failed you are much more likely
to yell "FATAL ERROR NNNNN!" over the wall. Everybody in the team would
cringe, hoping it wasn't them who had just called that function with
bad data. And the person who had just observed this, plus the person
who had inserted that sanity check were both "the good guys." This soon
led to adding sanity checks when we would find the box crashed in some
strange way and it took hours to realize we hadn't caught some bad case.

But this then led us to being able to test in a novel way. We wrote
some code on a test harness that would hammer the box with random input.
It would poke buttons and send in commands and present data, pretty
much completely randomly, but at 100 commands/second! Within seconds
of trying this a check blew up and we had another Fatal Error nnnnn.
But that let us find and fix an oversight quickly. After a number of
iterations we were to the point where this would run all weekend with
zero failures.

Then the decision was made, we were going to leave all these in the
code and live when we shipped it. Another team working across the
wall with a similar product was horrified, "You don't want your
customers to know you have BUGS, DO YOU?!?!" And our reply was that
they were going to know one way or the other. We shipped. And we
waited. And we waited. All the checks apparently had made us find
almost all the bugs before it went out the door.

One afternoon I did get a call from the marketing rep. He had a message
from the marketing secretary. She had a message from the receptionist.
She had a call from Hughes. They had been using this and it had popped
up "Fatal Error nnnnn" and just locked up. They were so astonished that
they went over to another building, got a camera, brought it back and
took a picture. Then they called. And I got nnnnn from 1500 miles away.
In 30 seconds I knew which check had failed, knew that it was a single
variable, knew it must have been out of range and I could now hammer
the box until I could figure out a way to find and fix that. I did.

After 18 months and with 2000 of the product in the field being used by
people pretty much full time we had 3 Fatal Errors found, and I thought
that was pretty much all of them that were ever seen because in the
manual it told them that if they ever saw this to call this phone number
and tell us that number so we could fix it for them. I found and fixed
those 3 and a number of others that I knew about but no customer would
likely ever see.

The guys across the wall, they had ten times the support team and didn't
even bother about bugs that didn't just crash the box, and if it did,
they just cycled the power and went on. I even tried to get marketing
to offer a campaign, I'd PAY customers for the first Fatal Error found.
They squashed that, it would have made the other team look bad.

One other item that helped with the sanity checks, we filled all memory
with 0xAAAA initially, and even when some memory was released. That
oddball value was unlikely to be a reasonable value for most state
variables and helped us fail more sanity checks.



Relevant Pages

  • Re: INTERNAL_POWER_ERROR on Vista ..Ultimate 32bit,
    ... Is your system disk a plain SATA on a standard controller? ... Look at the hibernate file and see if it is a few MB ... larger than total memory size as reported on the "My Computer", ... The power policy manager experienced a fatal error. ...
    (microsoft.public.development.device.drivers)
  • Re: INTERNAL_POWER_ERROR on Vista ..Ultimate 32bit,
    ... Look at the hibernate file and see if it is a few MB larger than total memory size as reported on the "My Computer", ... "Tim Roberts" schrieb im Newsbeitrag ... A fatal error occured while preparing the hibernate file. ...
    (microsoft.public.development.device.drivers)
  • Re: Wild weather
    ... network now control all meteorological events. ... One of the cornerstones of my continued sanity is the faith that ... you anchored your faith in sandy bottoms. ... born gift that is more easily compromised than retained. ...
    (misc.writing)
  • Freeing memory for DOMDocument
    ... I'm running through a large dataset and am generating/manipulating XML ... Fatal error: Allowed memory size of 167772160 bytes exhausted (tried ... like the memory for the DOMDocument objects I instantiate is not ...
    (php.general)
  • Re: Fatal Error Code 5
    ... (Subject: Fatal Error ... low on memory." ... Get Secure! ... When responding to posts, please "Reply to Group" via your newsreader so ...
    (microsoft.public.windows.terminal_services)