Re: PIC vs ARM assembler (no flamewar please)




"David Brown" <david@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:45dd6efc$0$24608$8404b019@xxxxxxxxxxxxxxxxxx
Wilco Dijkstra wrote:
"David Brown" <david@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:45dc2bf5$0$31548$8404b019@xxxxxxxxxxxxxxxxxx
Wilco Dijkstra wrote:
"David Brown" <david@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:45db1975$0$31521$8404b019@xxxxxxxxxxxxxxxxxx

I still don't see the point of trying to make black-and-white
classifications of cpus as *either* CISC, *or* RISC. You could divide
them into load/store and non-load/store architectures, which is perhaps
the most important difference (although there are no doubt hybrids there
too). Using that definition, the msp430 is CISC - but it has plenty of
RISC features (such as 16 registers - a lot for its size).

I was precisely using a scale between 0 and 10 to avoid black/white
classifications. Scores between 4 and 6 are in a grey area indeed.
MSP430 and CF/68K score well below 4, so are clearly CISCs
irrespectively what the marketing departments claim.

* load/store architecture: no
The 68k can handle both operands of an ALU instruction in memory, which
is CISC. The ColdFire can have one in memory, one in a register, which
is again half-way.

The ColdFire is no different from 68K in this aspect. Most ALU operations
can do read/modify/write to memory and the move instruction can access
two memory operands.

IIRC, the 68k could do some ALU operations with both operands in memory
(such as ADDX), and MOVE operations can use any addressing mode for both
operands. The CF is more limited to simplify decoding and operand fetch.

Yes, on CF only MOVE can have 2 memory operands. But almost all ALU
operations can still read/modify/write memory.

Another example of the simplifications is that the CF no longer supports
byte or (16-bit) word sizes for most operations - about the only
instructions that support sizes other than the native 32 bits are MOVEs.
So for other data sizes, you effectively have a load/store architecture.

Yes. I'm not sure why they didn't remove the 32-bit operations too,
if they had, it would definitely simplify hardware. The removal of
16-bit memory operations has little effect otherwise (they could have
kept them for better 68K compatibility).

I've worked for years with the 68332, and in recent times I've worked with
the ColdFire. I've studied generated assembly code, often made with the
same compiler, from the same source code. There is no doubt whatsoever -
the generated CF code makes much heavier use of register-to-register
instructions, with code strategies more reminiscent of compiler-generated
RISC code. This is partly because some of the more costly memory
operation capabilities were dropped from the 68k, and partly because the
CF is more heavily optimised for such RISC style instructions. If you
were to think of the CF as a RISC core with a bit too few registers, but
some added direct memory modes to compensate, you'd program fairly optimal
code - the same is not true for the 68k.

I think the RISC-style code would run pretty well on 68K, especially
on the later implementations. Compilers have improved a lot since
those early days, and keeping variables cached in registers is pretty
much essential nowadays. So while it made sense to use complex
instructions at the time on the 68000, it probably doesn't anymore.

Unaligned accesses are non-trivial so most RISCs left it out. However
modern CPUs nowadays have much of the required logic (due to
hit-under-miss, OoO execution etc), so a few RISCs (ARM and POWER)
have added this. Hardware designers still hate its complexity, but it is
often a software requirement. Quite surprisingly it gives huge speedups
in programs that use memcpy a lot.

That *is* surprising - the memcpy() implementations I have seen either use
byte for byte copying, or use larger accesses if the pointers are (or can
be) properly aligned.

The difficult case is when source and destination are not aligned.
A good memcpy never uses byte copy, not even in this case.
Unaligned accesses allow this case to be sped up to almost the
same speed as word aligned copies (only one of the pointers is
unaligned).

The instruction set for the PPC contains much more complicated
instructions than the CF. The 68k has things like division
instructions, which the CF has dropped.

What PPC instructions are complex? PPC is a subset of POWER
just like CF is a subset of 68K, so most of the complex instructions
were left out.

The mask and rotation instructions are examples of complex ALU
instructions, and there are several multi-cycle data movement instructions
(such as the load multiple word, and the load string word).

The mask and rotate are not very complex. Many RISCs have similar
operations, including bitfield insert and extract, and execute them in
a single cycle. The same is true for ARM's shift and ALU instructions.

Load/store multiple is indeed complex, but it is one of the most useful
instructions that exist. They are perfect for memcpy and efficient saving
and restoring a large number of registers on function entry/exit at
virtually
no codesize cost. Some implementations even transfer 2 registers per
cycle thereby doubling memory bandwidth. For Thumb-2 I invented a
special variant where you combine 2 load instructions to consecutive
addresses into a single instruction.

So their cost/benefit ratio is so good that it's a no brainer. A CPU
can treat them as a sequence of loads or stores so it fits fine in a
typical RISC pipeline.

A far more useful (and precise) distinction would be to look at the
implementation - does the architecture use microcoded instructions? RISC
cpus, in general, do not - that is one of the guiding principles of
using RISC in the first place. Traditional CISC use microcode
extensively. The 68k used microcode for many instructions - the CF does
not.

This is misguided. RISC *enables* simple non-microcoded
implementations. One can make a micro code implementation of a
RISC, but that doesn't make it any less RISC.

Again, I don't see RISC vs. CISC as a black and white division, but as a
set of characteristics. Microcoding is a CISC characteristic - it is
perfectly possible to have a mostly RISC core with CISCy microcode.

What I mean is that micro code is an implementation detail like pipelining.
Implementations vary over time depending on the available chip technology.
In the early days of RISC, pipelining, caches and no micro code were indeed
RISC characteristics. Few would call a CPU RISC today just because it is
pipelined or has caches... Nowadays CPUs micro sequence complex
instructions rather than micro code.

* calls place return address in a register: no
More generally speaking, CISC has specific purpose registers, while RISC
have mostly general purpose registers. Yes, the CF has extra
functionality on A7 to make it a stack pointer. Putting the return
address in a register, as done in RISC cpus, is not an advantage - it is
a consequence of not having a dedicated stack.

It is an advantage as it avoids unnecessary memory traffic - a key
goal of RISC.

It avoids an extra memory write (and subsequent read) in leaf functions,
at the cost of extra instruction fetches for the code to save and restore
the link register for non-leaf functions. I can't give you a detailed
analysis of the costs and benefits here, but I'd be surprised if it is a
distinct advantage.

It is. Most functions are leaf functions, so as long as you don't need the
register you avoid having to save/restore it, thus speeding up the call
and return. When calling another function you need to save it indeed,
but you can save several registers in one go using the load/store
multiple instructions. Returning and reloading is done in a single
instruction again. At worst (when you don't already need to save
some registers) it takes one extra instruction, on average it is a win.

I don't have any details of the CF pipeline. But a mispredicted branch
that hits the instruction prefetch cache (thus avoiding instruction
fetches) executes in 3 cycles. That's definitely a short pipeline.

I think you mean CF v2 which has a 4 stage pipeline. It achieves
about the same performance as the 3 stage ARM7. However
memory instructions are so slow (it's more a "micro coded" than a
pipelined implementation) it is better to avoid them altogether.

ColdFire v4 uses a 10 stage pipeline to execute "most" instructions
in 1 cycle (I don't think it can do 2 memory accesses per cycle).
It is claimed to give similar performance as the 5 stage ARM9.

This clearly shows that ColdFire needs far more pipeline stages than
a RISC to get similar performance, while a simpler micro coded
implementation has fewer pipestages.

Embedding large immediates in the instruction stream is good for code size
if there is no need to share them. If they are shared, then the typical
RISC arrangement of reading the values from code memory using a pointer
register and 16-bit displacement is more code efficient (for 3 or more
uses of the 32-bit data), but less bandwidth efficient (taking a 32-bit
instruction and a 32-bit read, compared to a single 48-bit instruction).

If literals aren't shared you break even on codesize on Thumb/Thumb-2.
My statistics showed that on ARM literals are shared over 3 times on
average (sharing happens across functions within source files), making
it a definite win (3 * 48 > 3 * 32 + 32).

My point is not that the CF is a RISC core - I never claimed it was. But
neither is it a CISC core in comparison to, say, the x86 architecture.
If there were such a thing as a scale running from pure RISC to pure
CISC, then the CF lies near the middle. It is not as RISCy as the ARM,
but is somewhat RISCier than the original 68k.

I agree CF is less CISCy than 68K but it is still more CISCy than x86.

I must have misread that - are you saying the CF (and 68k) is more CISCy
than the x86 ??

Yes. x86 has at most one memory operand while CF/68K need 2.
IMO they should either have kept full binary compatibility or removed all
of the complex instructions.

Wilco


.



Relevant Pages

  • Re: speed it up
    ... can load many registers at once from memory and put many instructions ... the inner loop is unrolled ... The above loop tells the compiler that 4 registers ...
    (comp.lang.cpp)
  • Re: OT: IA-128 ???
    ... Would creating 128but floating point registers provide ... engine to store more of s database in memory to increase performance. ... Intel always pre-announces new technologies at IDF ... I'm sure everyone here already knew that some SSE2 instructions ...
    (comp.os.vms)
  • Re: PIC vs ARM assembler (no flamewar please)
    ... The whole point of RISC is to be able to make a more efficient implementation - it is an architectural design philosophy aimed at making small and fast implementations. ... Thus the 68k is far from typical CISC, and is much more in the middle. ... The 68k can handle both operands of an ALU instruction in memory, ... Another example of the simplifications is that the CF no longer supports byte or word sizes for most operations - about the only instructions that support sizes other than the native 32 bits are MOVEs. ...
    (comp.arch.embedded)
  • Re: Is microprocessor an integrated circuit???
    ... >> PLEASE show an example of a microprocessor that doesnt use microcode ... it depends whether the microprocessor is RISC or CISC: ... instructions in that machine were the MMRB (Move Memory to Register ...
    (sci.electronics.design)
  • Re: [SLE] For or against ..Hyperthreading.
    ... > Let's say for mail or Database server... ... Opteron and Athlon-64 also have a different memory access archetecture so ... - 64-bit Address registers means more memory is directly addressable, ... do a wider range of operations as primitive instructions (for example, fetch, ...
    (SuSE)