Re: non load/store architecture?




David Brown wrote:
But in a lot of code it does matter.

Hold onto that thought, that it matters in "a lot" of code.

Take the
simple C code "x = 123456;", where "x" is a 32-bit global variable. On
the ColdFire, this compiles to:

move.l #123456,%d0
move.l %d0,x

Two instructions, each 6 bytes long, each executing in 1 clock (plus a
write access to memory).

On the PPC, this compiles to:

lis %r0,0x1
ori %r0,%r0,57920
lis %r9,x@ha
stw %r0,x@l(%r9)

(See what I mean about ColdFire code being nice and clear?)

Sure, but I got used to the load low, load high drill on the Alpha
easily enough.

That's four instructions, each 4 bytes, each executing in 1 clock (plus
a write access to memory).

The ColdFire generates more compact code, running at twice the speed for
the same clock. That's what I mean by greater work done per clock.

So it's faster at loading an immediate 32-bit constant. Big deal!
That is not an important job. The RISC tradeoff is legitimate here.
You don't really need to load immediate constants very much, so it is
acceptable to do an "instruction dance," that doesn't typically run
slower anyways, because the pipelining masks the latency. So you get
to keep all your instructions 4 bytes long, which simplifies your
decoder, and also your code cache alignment.

The Alpha had instructions for loading immediate constants that were
more likely to matter. 16-bit constants could be done in 1
instruction, and 3 or 4 bit constants were typically part the
instruction itself.

Now I suppose if you design CPUs with almost no cache, you might care
about instructions being small. But then, you're not designing a
performance CPU anyways. So who's gonna care about the performance?
"Good" won't mean optimization, it'll mean low power or cheap to
manufacture or something.


We are clearly coming from this from different experiences, if OpenGL
drivers on an Alpha are typical for your programming, while I work
mostly with smaller processors (the ColdFire I am using at the moment
has no cache - all its flash and sram are internal, with single cycle
access). But performance is very important to small systems - high
performance means you can use slower clock speeds, leading to lower
power, lower EMI, and cheaper components. It might not be the most
important factor, but it is still there.

Yep, very different. Part of why I started posting here, to figure out
what's different about "the kind of ASM I know" vs. "the kind of ASM
embedded engineers typically do."


Even on cached processors, small code means better use of the cache.
Critical loops will (should!) fit within even a small instruction cache,
but programs consist of more than their critical loops. A complete
instruction cache miss might mean a stall of a hundred or more clock
cycles (which might be worth twice that in instruction counts on a
superscaler processor) - there is a reason why more expensive processors
have larger instruction caches. More compact code gives the same
benefits of a larger cache.

Not unless you can *really* compact the code.


Cheers,
Brandon Van Every

.



Relevant Pages

  • Re: non load/store architecture?
    ... But the ColdFire code is more compact, ... instruction, giving better performance for the same ipc. ... Only data cache. ... and neither are most fast RISC cores as they need high clock ...
    (comp.arch.embedded)
  • Re: non load/store architecture?
    ... instruction, giving better performance for the same ipc. ... Only data cache. ... and neither are most fast RISC cores as they need high clock ...
    (comp.arch.embedded)
  • Re: Superstitious learning in Computer Architecture
    ... don't really eat up that much memory bandwidth. ... That's what instruction caches and Harvard architecture is for. ... about is a loop with a 100% hit in the instruction cache, ... There's also a processor+DRAM chip (Mitsubishi DN10000 series, ...
    (comp.arch.arithmetic)
  • Re: Itanium Solutions Alliance
    ... > No, Rob: as usual, you're hyping them far beyond what they're likely to ... was done basically to eliminate instruction stream competition for the ... bandwidth and capacity of the L2 data cache. ... By splitting the L2 caches in Montecito a lot of good things happen. ...
    (comp.os.vms)
  • Re: IBM 45nm -- new or licensed from Intel?
    ... constant table would be used while in cache, ... that much of the code in an instruction block is likely to be used. ... Yes constant loads can be scheduled freely, ... number of registers (which is not the case between x86-64 and ARM), ...
    (comp.arch)