Re: non load/store architecture?



Brandon J. Van Every wrote:
David Brown wrote:
But in a lot of code it does matter.

Hold onto that thought, that it matters in "a lot" of code.

Take the
simple C code "x = 123456;", where "x" is a 32-bit global variable. On
the ColdFire, this compiles to:

move.l #123456,%d0
move.l %d0,x

Two instructions, each 6 bytes long, each executing in 1 clock (plus a
write access to memory).

On the PPC, this compiles to:

lis %r0,0x1
ori %r0,%r0,57920
lis %r9,x@ha
stw %r0,x@l(%r9)

(See what I mean about ColdFire code being nice and clear?)

Sure, but I got used to the load low, load high drill on the Alpha
easily enough.

That's four instructions, each 4 bytes, each executing in 1 clock (plus
a write access to memory).

The ColdFire generates more compact code, running at twice the speed for
the same clock. That's what I mean by greater work done per clock.

So it's faster at loading an immediate 32-bit constant. Big deal!
That is not an important job. The RISC tradeoff is legitimate here.

The load was, of course, just a simple example.

You don't really need to load immediate constants very much, so it is
acceptable to do an "instruction dance," that doesn't typically run
slower anyways, because the pipelining masks the latency. So you get
to keep all your instructions 4 bytes long, which simplifies your
decoder, and also your code cache alignment.


Pipelining increases latency, but lets the instructions overlap, which may or may not hide the extra delay, depending on when the loaded constant is needed in the following instruction stream, and how superscaler the cpu is (you know this better than me, but I'm learning a little here).

Consistent instruction lengths certainly simplify the decoder and alignments - but it costs code space when 32 bits is too much, and extra instructions when 32 bits is too little. Is this cost worth paying? As always, it depends on the typical use of the processor, and on the details of the implementation.

Certainly the x86's widely variable instruction lengths with only 8-bit alignment are a poor match for caches and fast decoders. The ColdFire has 16-bit alignment, and instructions are at most 3 words long - a compromise solution. The original m68k design, especially the advanced cores like the 68040, had several more addressing modes that could lead to longer instructions. When designing the ColdFire, FreeScale (then Motorola) removed the more complex modes exactly so that they could get a faster, simpler decoder and execution engine.

The Alpha had instructions for loading immediate constants that were
more likely to matter. 16-bit constants could be done in 1
instruction, and 3 or 4 bit constants were typically part the
instruction itself.


And the ColdFire can load 8-bit constants as part of a single 16-bit instruction word (v4 cores can also store a 3-bit constant directly to memory in a single instruction, without passing through a register).

It's certainly the case that many data constants are small, and it's important to optimise the ISA for that case. But addresses are often 32-bit (small data segments can help here - I don't know about the Alpha, but the PPC certainly uses them), and need to be loaded. The code "x = 1;" takes 3 instructions on the PPC, and 1 on the v4 ColdFire (12 bytes vs. 6 bytes, and 3 clocks vs. 1, excluding the actual write). Of course, real code sequences are unlikely to exhibit such a difference.


Now I suppose if you design CPUs with almost no cache, you might care
about instructions being small. But then, you're not designing a
performance CPU anyways. So who's gonna care about the performance?
"Good" won't mean optimization, it'll mean low power or cheap to
manufacture or something.

We are clearly coming from this from different experiences, if OpenGL
drivers on an Alpha are typical for your programming, while I work
mostly with smaller processors (the ColdFire I am using at the moment
has no cache - all its flash and sram are internal, with single cycle
access). But performance is very important to small systems - high
performance means you can use slower clock speeds, leading to lower
power, lower EMI, and cheaper components. It might not be the most
important factor, but it is still there.

Yep, very different. Part of why I started posting here, to figure out
what's different about "the kind of ASM I know" vs. "the kind of ASM
embedded engineers typically do."


It's always interesting to hear different viewpoints. I have never used an Alpha (though I've read nice things about it), and my PPC experience is as a microcontroller core.

Even on cached processors, small code means better use of the cache.
Critical loops will (should!) fit within even a small instruction cache,
but programs consist of more than their critical loops. A complete
instruction cache miss might mean a stall of a hundred or more clock
cycles (which might be worth twice that in instruction counts on a
superscaler processor) - there is a reason why more expensive processors
have larger instruction caches. More compact code gives the same
benefits of a larger cache.

Not unless you can *really* compact the code.


I could not give you any figures without making things up. But are you familiar with the ARM, and its Thumb mode? The ARM is a 32-bit pure RISC design (32 registers, 32-bit wide, 32-bit fixed instruction size with "Ra = Rb op Rc" style instructions). It also has a "Thumb" mode, in which the instruction set is 16-bit wide, with immediate data or addresses as extension words, and using a smaller register set and "Ra = Ra op Rb" instructions. These Thumb instructions are translated into full ARM instructions by an extra decoder. The reason for having the Thumb mode is to get significantly smaller code, for embedded systems. In general, the Thumb code is slightly slower than pure ARM code, but if the bandwidth to the code store (i.e., Flash) is slow, then the Thumb code is faster.

So somebody at ARM thought a compromise ISA (closer to the ColdFire than the full ARM) was worth the effort, at least for embedded systems.

mvh.,

David



Cheers,
Brandon Van Every

.



Relevant Pages

  • Re: non load/store architecture?
    ... But the ColdFire code is more compact, ... instruction, giving better performance for the same ipc. ... Only data cache. ... and neither are most fast RISC cores as they need high clock ...
    (comp.arch.embedded)
  • Re: non load/store architecture?
    ... instruction, giving better performance for the same ipc. ... Only data cache. ... and neither are most fast RISC cores as they need high clock ...
    (comp.arch.embedded)
  • Re: non load/store architecture?
    ... (See what I mean about ColdFire code being nice and clear?) ... the same clock. ... acceptable to do an "instruction dance," that doesn't typically run ... and also your code cache alignment. ...
    (comp.arch.embedded)
  • Re: rep stosb
    ... "loop mode" aimed at exactly this instruction sequence, which meant that there was no need to fetch or interpret such two instruction loops. ... But the ColdFire does not have a "loop mode", nor does it have a "decrement and branch" instruction - these were removed in order to get a more ... For a v2 core with an external bus, you get two instruction cycles per bus cycle, so the tight loop would easily saturate the bus. ...
    (comp.sys.m68k)
  • Re: Superstitious learning in Computer Architecture
    ... don't really eat up that much memory bandwidth. ... That's what instruction caches and Harvard architecture is for. ... about is a loop with a 100% hit in the instruction cache, ... There's also a processor+DRAM chip (Mitsubishi DN10000 series, ...
    (comp.arch.arithmetic)