Re: books for embedded software development
- From: Alessandro Basili <alessandro.basili@xxxxxxx>
- Date: Fri, 16 Dec 2011 06:49:14 +0100
On 12/14/2011 9:22 PM, Don Y wrote:
1. not fulfilling the specification; the software is supposed to provide
compressed data for stellar fields and it simply does not.
Is there a *real* specification? Or, just a general "goal"?
I.e., is there anything to test against?
And indeed I'm working on that as well, at least to have numbers right.
I believe now the expectations are way beyond the current hardware design.
2. "unstable"; after few hours of operation in non-compressed mode the
software hangs and a hardware reset is needed.
(speaking with zero knowledge of the application) This is suggestive
of memory (management) problems, counter overflows or "deadly embrace".
(I mention these simply to get a feel for what "services" your
application can benefit from)
The DSP program has a bootstrap code which loads a very minimal utility
program that we call "loader". The loader is capable of executing few
commands like "write flash", "read flash" (that we use to upload new
software) and "jump to main", where "main" is our main application.
After few hours running in the main, it looks like the hardware goes
back to the loader, as if there was an hard RESET that would bootstrap
the loader. In the main program there is no intention to jump back to
the loader, that's why this looks a bit strange.
We experienced the same behavior on ground so the magic bit flip is
harder to sell!
3. structureless; after a code review it is clear that debugging it
would be more costly rather then redesigning it from scratch.
(sigh) That is probably the case. Even (just a) disciplined developer
would impart *some* structure to his/her code. And, from your other
comments, it seems like this was an "ad hoc" team effort so any
attempt at structure was lost in the noise.
To the developer excuse I must admit he had both no experience neither a
mentor. I must say though that I believe that a good amount of
self-criticism often compensate the lack of knowledge, when you realize
you don't know much after all.
4. full of logical flaws; synchronization problems are the most common
mistakes, but interrupt service routines are excessively and needlessly
long, essentially preventing any time analysis.
Synchronization problems can be avoided with a disciplined design.
This would also help identify cases of priority inversion.
I am actually very sensitive to this problem. I do believe that for
these kind of applications the complexity of multi-tasking or
multi-threading is not necessary and a simple hierarchical state machine
may get the job done, but since I have to serve the serial port in a
timeliness fashion I'm not quite sure I would control the timing of the FSM.
I personally believe that interrupts should be setting flags and that's
it, in this way the synchronization is totally handled at the FSM level
and I shouldn't chase funny combinations of interrupts occurring in
various moments (how would I test that???).
Long ISRs are usually a consequence of a lack of structure. No
mechanism to systematically pass data/events between foreground
and background so the "processing" finds its way into the ISR.
5. overly complicated in the commanding interface; the software tries to
handle a command queue with no control over the queue itself (queue
reset, queue status).
This also seems like it should have been apparent in the specification
(else how does the commanding agent know what rules *it* must live by??)
What was specified in the specs was that for every command there should
be a reply, but unfortunately here also it was not clear when the reply
will come, since some processes are slower than others and so on. Given
the fact that the software had a "command queue" it would have been
possible to ignore the reply to the first command continuing sending the
Along the software implementation commands were added as the developer
needed one command more to accomplish what he wanted to do. The result
is a bunch of commands, each of them with its own interface of
parameters and no clear indication how each of them is processed.
6. lacking control over the CCD; there's a register implemented in the
hardware which gives control over the CCD functionalities, but has been
ignored in the current implementation.
<frown> Sorry, I don't understand the role it plays (in the hardware
or system itself). On the surface, it seems to suggest: "How the
hell could the device *function* if it has no control over its
My bad, I should have elaborated over this a little more. The CCD is
actually only the sensor, while an FPGA controls the way the chip shoots
the picture sample the data and then an ADC converts it to a digital
picture. This FPGA is controlled via a register that is write/read
accessible from the DSP. So if I want to integrate the light on a longer
time I could send a command to the FPGA simply writing few bits in this
Now we are trying to assess how the register look like (reading the
vhdl), of course no documentation is providing any detail about the
7. reinventing the wheel with basic functions; the C-runtime library
provided by AD is completely ignored and a long list of functions have
been implemented apparently for no reasons.
Are you *sure* that the library is disused OUT OF IGNORANCE? (from
your other comments, this seems like it is probably the case) But,
be aware that sometimes vendor supplied libraries are provided as
check-off items and often not suited to a particular environment
I understand that, but I would start at least from reading the source
code of the basic functions. They might be a check-off item, but I
believe they are worth using as a first approximation.
8. not utilizing the available bandwidth; there's a serial port through
which the DSP can write its data to an output buffer, but the bandwidth
available is reduced to essentially 256/1920 due to the handshake
protocol implemented (where on the other side no one is really
performing any hand shake).
Huh? I assume you mean the "other party" has no *need* for the
handshaking (so it is wasted overhead)? Is the "handshaking"
intended as a pacing mechanism or as an acknowledgement/verification
I believe it was intended as a pacing mechanism, since nobody is
verifying anything on the "other party". But the format of the message
didn't allow more than 256 bytes, effectively reducing the possibility
to send data out up to 1920 bytes/sec.
We have a GPS onboard which is continuously sending data and may differ
on the type of messages it sends according to configuration. In this
case there's no pacing, but it fully utilizes the bandwidth.
This limit poses a very hard constraint on
the science data, to the point where the information is not enough to
reconstruct pointing direction with the accuracy needed.
Meaning you can't get data often enough to point the satellite
(or your instrument therein) *at* the correct target?
We reconstruct pointing on ground, i.e. every picture comes with a
timestamp and a the N brightest stars in the field of view. The bigger
is N the better the algorithm on ground recognize the stellar field.
Moreover the higher the frequency of sampling the higher the accuracy
(less need for interpolation). But all these factors are increasing the
volume of data that needs to be transferred.
10. not designed for testing; there are a lot of functions that are not
observable and there's no logging mechanism in the code for tracing
either. It's what I usually call "plug and pray" system.
I think there are other items that would call for a redesign, as lack of
documentation, lack of revision control system, lack of test campaigns,
lack of tools to work with the software, ...
All these are "strongly desired" -- especially with the stakes as they
are (I suspect it is very difficult/costly to get instruments flying!)
Regarding testing, I had in mind to add a tracing mechanism (a sort of
printf) that would fill a log with some useful information that can be
dumped regularly or on request. The implementation shouldn't add too
much overhead but I believe that if used with care can give great
insights to the flow. As an example it could be possible to log how much
time it is spent in each function.
The shuttle flight to launch our experiment costs 500M$. The cost of the
experiment is evaluated 2B$.
Then, add to that list any "issues" that have been uncovered but
not yet (satisfactorily) addressed in the conceptual design of
the device. E.g., any "mysterious/anomalous" behaviors that
may have been noticed (and possibly "resolved themselves" *or*
were handled by resetting the device).
IMO, you'll get more *practical* "return for this time invested
than rethinking an implementation methodology (which you might
I agree that redesign is not the solution, if you want is just a mean to
get it working. We normally follow an "iterative and incremental "
model, in order not to invest too much time in the "wrong design" and
leave room for adjustment along the way.
I tend to favor heavily front-loaded processes -- putting lots of
effort into nailing down all the details in a specification which
can then be followed, almost blindly. But this requires folks
who are good at challenging assumptions to be able to foresee
the things that can go wrong -- and fortify the specification
against them. Frankly, I don't know how else to develop especially
in an environment where you have no physical control over the
"what if's" (what if the spacecraft is pointed the wrong way?
what if communications are interrupted at this point? etc.)
I understand your point and I'm not denying the fact that investing in a
well defined set of specs and a good design pays off later on. I also
believe you have to factor in the personal background each member of the
team has. It is very hard to change the way people work, after all we
human beings are a fundamentally lazy animal ;-)
In addition to serving as a map that "implementors" (coders?) can
"just follow", a good spec gives you a contract that you can
test against. And, for teams large enough to support a division
of functions, lets the "test team" start designing test platforms
in which the *desired* functionality can be verified as well as
*stressed* in ways that might not have been imagined. This can
speed up deployment (if all goes well) and/or bring problems in
the design (or staff!) out into the open, early enough that
you can address them ("Sheesh! Bob writes crappy code! Maybe
we should think of having Tom take over some of his duties?"
or "Ooops! This compression algorithm takes far too long to
execute. Perhaps we should aim for lower compression rates
in favor of speedier results?")
This is why we actually prefer to have the an iterative and incremental
approach, the early testing would make us go back to redefine better the
specs and adjust the aiming along the way.
A waterfall model may result in problems if the specs are not so
thoroughly checked and at the same time they are engraved in stone.
Actually I tried to get the chance to have some good reference to foster
That can be a good approach -- depending on the dynamics of
your particular team. What you want to avoid is the distraction
of people focussing on "arguing with the examples/guidelines"
instead of learning from them and modifying them to fit *your*
[think about how "coding guidelines" end up diverting energy
into arguing about silly details -- instead of recognizing the
*need* for SOME SORT OF 'standard']
Gee that's another thing knocked me off. I don't blame people who have a
different coding style, as long as they have one. Lower and upper case
are one of people favorite leisure. It seems to me they let FATE decide
what would be the case of next letter in the word.... arghhh!
I like to approach designs by enumerating the "things it must do"
from a functional perspective. I.e., "activities": control the
transducer, capture data from the transducer, process that data,
transmit that data... (an oversimplification of your device -- but
I don't know enough about it to comment in detail). Note these
are all verbs -- active.
Personally I believe that block diagrams fulfill the need pretty well,
if some timing is needed then a waveform like drawing with a
cause-effect relationship between signals may help a lot to understand
What I've always seriously doubted is a flow chart of the program. They
rarely match what the program is doing (also because it would be nice to
see how to include your interrupts in a flow char) and often give the
impression that once you have it done the software is "automatically
generated". I personally have never seen a flow chart which corresponds
1:1 to the program, maybe is just my lack of experience.
Then, identify the communications between these "activities".
And, the resource requirements of each.
[this is all informal, "shirt-cuff" at this point]
This gives me an idea of how finely I can partition the design.
The resource requirements tell me what constraints exist in
terms of how much can happen concurrently. E.g., do I have
enough memory/CPU to *collect* (new) data while processing
(previous) data AND transmitting (old) data? If not, what
value judgement can I make to best use the resources that I *do*
have to maximize the functionality of the device? (e.g., if
I have lots of memory but very little CPU, I might prefer
to gather as much raw data *now* -- while some observable event
is happening -- and worry about processing and transmitting it
Believe it or not the memory mapping of the board was the first document
I did and it is still incomplete (not yet sure about few FPGA registers!).
This simple document allowed me to understand how the memory is intended
to be used.
This gives my first rough partitioning of tasks/threads/processes
and "memory regions". It also shows where the data is flowing and
any other communication paths (even those that are *implicit*).
Synchronization needs then become obvious. And, performance
bottlenecks can be evaluated with an eye towards how the
design can be changed to improve that.
I think synchronization is really complex whenever you are down to the
multi-thread business and/or have multiple interrupt servicing. Given
the old technology and luckily very few support for an OS (I haven't
found any), I was aiming to have a very simple, procedural design which
I believe would be much easier to test and to make it meet the specs.
To backup a bit more this motivation I just finished to write an
extremely simple program to toggle a flag through the timer counter
interrupt. The end result is that I failed to get the period I want and
moreover is clear that interrupts are lost from time to time.
Since in this last case I was kicking the dog with this flag, I actually
couldn't care less if I lost an interrupt as long as the period is
enough to keep the dog quiet. But I got discouraged by a post on
comp.dsp which stated: "This is embedded ABC basics: don't kick a dog in
the interrupts." but no motivation was given.
Now my point is, how much time should I invest to make it working rather
than exploiting a totally different path? If I had an infinite time I
would probably try to make this stupid interrupt work the way I expect
but these details may delay a lot if not irreversibly the project.
E.g., if an (earth-based) command station has to *review*/analyze
data from the device before it can reposition/target it, then
the time from collection thru processing and transmission
is a critical path limiting how quickly the *overall* system
(satellite plus ground control) can react. If the times
associated with any of those tasks are long-ish, you can
rethink those aspects of the design with an eye towards
short-cutting them. So, perhaps providing a mechanism
to transmit *unprocessed* data (if the processing activity
was the bottleneck) or collect over an abbreviated time window
(if the collection activity was the bottleneck).
This reinforce my personal opinion that having the design in a block
diagram form would allow for these kind of "shifts" easier to see.
Adding processing time to functions may give a lot more details to avoid
or bypass bottlenecks.
Once the activities and communications are identified, I
look to see what services I want from an OS -- and the
resources available for it. IMO, choice of services goes
a *LONG* way to imposing structure on an application!
And, it acts as an implicit "language" by which the
application can communicate with other developers about
what its doing at any particular place.
E.g., do you need fat pipes for your communications?
Are you better off passing pointers to memory regions
to reduce bcopy()'s? Can you tolerate shared instances
of data? Or do you *need* private instances? How finely
must you resolve time? What are the longest intervals
that you need be concerned with?
Here I'm a strong supporter of statically allocated memory, unless is
not enough. Dynamically allocated memory is one of the things most
programmers end up failing and even without noticing it. Countless the
amount of time I saw a malloc in the middle of the function which was
not free'ed at the end.
Shared memory is also something that unless needed should be avoided
IMO. What is the advantage to have a memory shared when there's enough
I would also set aside some resources for (one or more)
"black boxes". These can be invaluable in post-mortems.
What do you mean post-mortem?
Ideally, if you have a ground-based prototype that you
can modify, consider adding additional memory to it
(even if it is "secondary" memory) for these black boxes.
Having *lots* of data can really help identify what is
happening when things turn ugly. (much easier than
trying to reproduce a particular experiment -- which
might not be possible! -- so you can reinstrument it!)
Litter your code with invariant assertions so you see
every "can't happen" when it *does* happen! :>
I agree and this is why I believe I would need to add a logging
capability, in order to see what happened after the fact.
I never thought about changing the hardware, since I've always believed
that adding hardware is preventing me from building the tools and maybe
the software structure. A good example is the emulator. I see people are
increasingly developing with the emulator, then they have to unplug it
and give the product away and now they don't have any tool to assess the
state/functionality since they were always used to use the emulator and
now they cannot work anymore without it.
Finally, testing is critical. The goal *should* be to
"break" the device. Really! And then look at the
conditions in which it did "break" and see how those
relate to your actual deployment ("Well, the system
crapped out when the 40KHz signal from the switching
power supply was coupled to the NMI pin...")
Likely here we have some experience in "breaking" things, but jokes
apart I like the idea of testing not to check if it works but to check
when it *does not* work. A simple example in that regards we have a list
of commands on board the main computer that cannot exceed 256 items.
Since everybody knew that 256 was the limit no one ever tried to send
more, up to the point when by mistake it happened and the main
application had to be rebooted, due to a problem on the 256th item!
Testing in order to break puts actually builds up the reliability of the
software which otherwise looks fragile, so depending on everything else
- Prev by Date: Re: Android vs Qt vs C/C++
- Next by Date: Re: books for embedded software development
- Previous by thread: Re: books for embedded software development
- Next by thread: Re: books for embedded software development