Re: Zero operand CPUs
- From: Jeff Fox <fox@xxxxxxxxxxxxxxxxxxx>
- Date: Tue, 17 Mar 2009 11:17:22 -0700 (PDT)
On Mar 17, 1:24 am, Jonathan Bromley <jonathan.brom...@xxxxxxxxxxxxx>
wrote:
Strange. I'm sure you too can see what to do about this -
same as you had to do on the AMD29K which used some of its
huge register set to cache the top of the stack. Implement
the stack in on-chip RAM, using circular addressing. When
the on-chip stack threatens to overflow, "spill" some of
the oldest part of it to main memory, using a fast block
write operation. CPU operations then continue to use the
on-chip stack but the circular addressing no longer overflows.
Similarly, when the stack threatens to underflow, "fill"
from main memory. Way back then, you could get quite good
performance if you were careful to align the spill/fill
operations with a DRAM page. The 29K used software trap
routines to do the spill/fill, but I'm sure you could
do it at least partly in hardware without too much trouble.
Spill/fill can then be done speculatively, in the background,
when there is spare bandwidth on the memory interface.
The first generation zero operand Forth machine, Novix,
used three memory busses to be able to manipulate the
data stack, the return stack, and main memory in a
single cycle. The three memory busses with stack pointers
made it easy to switch tasks very quickly.
The 29k description also describes the stack operation on
Chuck Moore's second generation Forth chip, Sh-Boom.
Sh-Boom got only 100 Forth mips back in 1988 when my
Intel machines only got a few Forth mips.
There is a discussion of spill/fill in Koopman's book
on stack machines where he shows how often stacks spill
based on how many cells are cached in registers. However
Moore rejected the hardware spill/fill for software spill
fill in his full custom vlsi zero operand designs.
Unfortunately the stack cache trashes multi-threading
performance, because there is so much context to swap.
That's a good point. Unless you have banks of registers
stack cells cached in registers reduce task switching
performance. However the fourth generation machine
designs were designed for multiprocessing and not
so much for multitasking. Switching tasks and
interrupts need memory cycles which takes a lot of
time compared to the few picoseconds needed for a
dedicated processor to react to an event.
I guess the correct compromise these days would be very
different, with the stack cache probably about 16 words.
According to Koopman's research caching eight cells
will result in a spill about 1% of the time. My experience
as director of software at the iTV corporation developing
Internet Appliances was that it was much less than that
with well designed code.
With only a small stack cache you can keep several process's
stacks in the on-chip memory (that's harder to plan,
of course, but may still be helpful particularly in
a small system).
Since spill/fill happened so infrequently in this kind of
software the decision was made to use software and design code
to do it when needed. For that once in a decade spill/fill we
used software.
I believe the real issue is that the focus is on building a complex
machine and trying "techniques" to make it simple and fast. The more
I look at things like this, the more I am convinced that the Moore
philosophy is right. You can achieve performance by adding more and
more complexity, or you can simplify to the point of inherent speed.
The philosophy mentioned says that there are a dozen things you can
do to simplify the design to reduce cost and power use and increase
speed. To get 700 mips in .18u using only 20k transistors without
pipelining or memory caching you have to have a simple design. To
burn thirty times less energy on a given computation than an MS430,
to get response to events in a few nanoseconds or to fit a hundred
core on a tiny low power embedded chip requires a simple design.
Having stacks in registers, packing multiple opcodes per word,
and decoding opcodes while they execute are all example of
the techniques used.
Of course one of the design points of the third generation
zero operand Forth machines was that the design fell out of
greatly simplifying the compiler. The first cross compiler was
an additional 300 bytes to a Forth system. And the idea
was to simplify both the hardware and the software.
Always provided you have sufficiently smart compilers to
convert complicated real-world code into a suitable stream of
your simple instructions. But in general I think I agree.
Compilers _are_ pretty smart these days.
The compilers the op was talking about are very simple. The
Forth compilers for the high performance full custom vlsi
Forth chips are relatively simple. I will admit that with
many core executing streamed instructions that the instruction
stream packet builders are useful. But it is nothing like
dealing with deep pipelines and multi-level memory cache
on complicated processors.
There is a pretty simple almost one to one correspondence
between the Forth source code and the object code. What the op
mentioned is in contrast to the complex smart compilers that are
used and needed with complex pipelined and cached architectures.
The full custom vlsi cad design software used to create
Moore's zero operand designs is a good example of his
approach to keeping software simple. The compilers and
OS and chip design and layout and simulation and design
rule check software sufficient for multi-mega transistor
chip design, several chips designs, and documentation
fit easily on one floppy drive. This kind of software is a
natural fit to the kind of hardware being designed in this
process.
Best Wishes
.
- Prev by Date: Re: Automated Testing for Embedded System
- Next by Date: Help for initializing a BW Graphic LCD module
- Previous by thread: Automated Testing for Embedded System
- Next by thread: Help for initializing a BW Graphic LCD module
- Index(es):
Relevant Pages
|