Re: PIC vs ARM assembler (no flamewar please)



Wilco Dijkstra wrote:
"David Brown" <david@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message news:45dc2bf5$0$31548$8404b019@xxxxxxxxxxxxxxxxxx
Wilco Dijkstra wrote:
"David Brown" <david@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message news:45db1975$0$31521$8404b019@xxxxxxxxxxxxxxxxxx

RISC and CISC are about instruction set architecture, not implementation
(although it does have an effect on the implementation).

The whole point of RISC is to be able to make a more efficient implementation - it is an architectural design philosophy aimed at making small and fast (clock speed) implementations.

That's a good summary.


It's nice that we don't entirely disagree!

The ColdFire core is very much such a mixed chip - in terms of the ISA, it is noticeably more RISCy than the 68k (especially the later cores with their more complex addressing modes), and in terms of its implementation, it is even more so. Even the original 68k, with its multiple registers and (mostly) orthogonal instruction set is pretty RISCy.
Well, let's look at 10 features that are typical for most RISCs today:

* large uniform register file: no (8 data + 8 address registers)
Typical CISC is 4 to 8 registers, each with specialised uses. Thus the 68k is far from typical CISC, and is much more in the middle.

There are various CISCs (eg VAX, MSP430) that have 16 registers,
while most RISCs have 32 or more.


I still don't see the point of trying to make black-and-white classifications of cpus as *either* CISC, *or* RISC. You could divide them into load/store and non-load/store architectures, which is perhaps the most important difference (although there are no doubt hybrids there too). Using that definition, the msp430 is CISC - but it has plenty of RISC features (such as 16 registers - a lot for its size).

* load/store architecture: no
The 68k can handle both operands of an ALU instruction in memory, which is CISC. The ColdFire can have one in memory, one in a register, which is again half-way.

The ColdFire is no different from 68K in this aspect. Most ALU operations
can do read/modify/write to memory and the move instruction can access
two memory operands.


IIRC, the 68k could do some ALU operations with both operands in memory (such as ADDX), and MOVE operations can use any addressing mode for both operands. The CF is more limited to simplify decoding and operand fetch.

Another example of the simplifications is that the CF no longer supports byte or (16-bit) word sizes for most operations - about the only instructions that support sizes other than the native 32 bits are MOVEs. So for other data sizes, you effectively have a load/store architecture.

I've worked for years with the 68332, and in recent times I've worked with the ColdFire. I've studied generated assembly code, often made with the same compiler, from the same source code. There is no doubt whatsoever - the generated CF code makes much heavier use of register-to-register instructions, with code strategies more reminiscent of compiler-generated RISC code. This is partly because some of the more costly memory operation capabilities were dropped from the 68k, and partly because the CF is more heavily optimised for such RISC style instructions. If you were to think of the CF as a RISC core with a bit too few registers, but some added direct memory modes to compensate, you'd program fairly optimal code - the same is not true for the 68k.

* naturally aligned load/store: no
That is purely an implementation issue for the memory interface. It is common that RISC cpus, in keeping with the aim of a small, neat and fast implementation, insist on aligned access. But it is not a requirement - IIRC, the some PPC implementations can access non-aligned data in big-endian mode. The ColdFire is certainly more efficient with aligned accesses, but they are not a requirement.

Unaligned accesses are non-trivial so most RISCs left it out. However
modern CPUs nowadays have much of the required logic (due to
hit-under-miss, OoO execution etc), so a few RISCs (ARM and POWER)
have added this. Hardware designers still hate its complexity, but it is
often a software requirement. Quite surprisingly it gives huge speedups
in programs that use memcpy a lot.


That *is* surprising - the memcpy() implementations I have seen either use byte for byte copying, or use larger accesses if the pointers are (or can be) properly aligned.

* simple addressing modes: no (9 variants, yes for ColdFire?)
...
All in all, the CF modes are only marginally more complex than the PPC modes.

It's the (d8 + Ax + Ri*SF) mode that places it in the complex camp.
The first not only uses a separate extension word that needs
decoding but also must perform a shift and 2 additions...


Yes, that's a complex one, and it's slightly surprising that it survived the jump from 68k to CF. I think it was included as it is the only mode that can get its address from the sum of two registers, which is a common requirement (the PPC has such an addressing mode). Since an extension word is needed, the 68k architecture put the extra bits to good use - a scale factor of 1, 2, 4 or 8, and the remaining bits giving an offset which is probably seldom used.

The instruction set for the PPC contains much more complicated instructions than the CF. The 68k has things like division instructions, which the CF has dropped.

What PPC instructions are complex? PPC is a subset of POWER
just like CF is a subset of 68K, so most of the complex instructions
were left out.


The mask and rotation instructions are examples of complex ALU instructions, and there are several multi-cycle data movement instructions (such as the load multiple word, and the load string word).

A far more useful (and precise) distinction would be to look at the implementation - does the architecture use microcoded instructions? RISC cpus, in general, do not - that is one of the guiding principles of using RISC in the first place. Traditional CISC use microcode extensively. The 68k used microcode for many instructions - the CF does not.

This is misguided. RISC *enables* simple non-microcoded
implementations. One can make a micro code implementation of a
RISC, but that doesn't make it any less RISC.


Again, I don't see RISC vs. CISC as a black and white division, but as a set of characteristics. Microcoding is a CISC characteristic - it is perfectly possible to have a mostly RISC core with CISCy microcode.

* calls place return address in a register: no
More generally speaking, CISC has specific purpose registers, while RISC have mostly general purpose registers. Yes, the CF has extra functionality on A7 to make it a stack pointer. Putting the return address in a register, as done in RISC cpus, is not an advantage - it is a consequence of not having a dedicated stack.

It is an advantage as it avoids unnecessary memory traffic - a key
goal of RISC.


It avoids an extra memory write (and subsequent read) in leaf functions, at the cost of extra instruction fetches for the code to save and restore the link register for non-leaf functions. I can't give you a detailed analysis of the costs and benefits here, but I'd be surprised if it is a distinct advantage.

If we add in some other features that are a little more implementation dependant (and therefore entirely relevant, since that is the reason for RISC in the first place), things are a bit different:

* Single-cycle register-only instructions: yes
* Short execution pipeline: yes
* (Mostly) microcode-free core: yes
* Short and fast instruction decode: half point
* Low overhead branches: yes
* Stall-free for typical instruction streams: yes

Suddenly the scores are looking a bit different.

I don't see how the scores change at all. Most of the features you
mention are "yes" for 68K implementations (except for the original
68000 which scores 4 out of 6), ColdFire and ARM.


Exactly the point - when you include these typical RISC features as well as your chosen features, the CF scores much more like the ARM. I'm not claiming in any way that the CF is RISCier than the ARM, or even *as* RISCy - just that it has far more typical RISC features than you give it credit for.

Perhaps we could compare the CF to traditional CISC features:

* Specialised accumulator: no

Many famous CISCs are not accumulator based, eg PDP, VAX, 68K,
System/360 etc. Accumulators are typically used in 8-bitters where
most instructions are 1 or 2 bytes for good codesize.


Specialised accumulators are a typical CISC feature, even though they are by no means universal.

* Microcoded instructions: no

Implementation detail. CF is still complex enough that micro
coded implementations might be a good choice.

* Looped instructions: no

Loop mode is just an implementation optimization that could be done
on any architecture.

* Direct memory-to-memory operations: no

Eh, what does move.l (a0),(a1) do? It's valid on CF.


I intended to refer to ALU operations, sorry.

* Bottlenecks due to register or flag conflicts: not often
* Long pipelines: no

Longer than an equivalent RISC (mainly due to needing 2 memory
accesses per instruction and more complex decoding). And likely
longer than a simpler microcoded implementation.


Are you are making this up out of thin air?

I don't have any details of the CF pipeline. But a mispredicted branch that hits the instruction prefetch cache (thus avoiding instruction fetches) executes in 3 cycles. That's definitely a short pipeline.

A fair proportion of CF instructions are single-word, and a single memory access reads two such instructions. I'd estimate that you'd have slightly less than one memory access per instruction on average, but of course that's highly code dependant. Instructions are aligned with their extension words as they are loaded into the prefetch cache, so decoding is not any more complicated or time-consuming than for a RISC instruction set - the coding format is nice and regular.


As I said, with the Thumb-2, the ARM is gaining the CISC feature of variable length instructions - I did not say it is changing into a CISC architecture. The real world is grey - there is no dividing line between CISC and RISC, merely a collection of characteristics that some chips have and others don't.

Sure, there is always a grey area in the middle, but most ISAs
clearly fall in either camp. If you use my rules, can you mention one
that scores 4 or 5?


I wouldn't use your rules - they are picked specifically to match you argument (and even then, you placed the ARM Thumb at 6). Add in the six I picked, and the ColdFire is at 8 out of 16. Of course, my rules, like yours, are arbitrary and unweighted, so they hardly count as an objective or quantitative analysis.

Most ISAs can certainly be classified as roughly RISC or roughly CISC - I'll not deny that, and given a choice of merely RISC or CISC, I'd classify the CF as CISC without hesitation. All I am trying to say is that there are characteristics that are typical for each camp, and that architectures frequently use characteristics from the "opposing" camp to make a better chip. The CF has a lot more RISC features than most CISC devices, and the ARM is picking up a few more CISC features with their newer developments. My original statement, that the inclusion of variable-length instructions in Thumb-2 makes the ARM more like the CF, is true.

Adding these variable length instructions is a good thing, if it doesn't cost too much at the decoder. It increases both code density and instruction speed, since it opens the path for 32-bit immediate data (or addresses) to be included directly in a single instruction.

Actually, embedding large immediates in the instruction stream is
bad for codesize because they cannot be shared. For Thumb-2 the
main goal was to allow access to 32-bit ARM instructions for cases
where a single 16-bit instruction was not enough. Thumb-2 doesn't
have immediates like 68K/CF.


Embedding large immediates in the instruction stream is good for code size if there is no need to share them. If they are shared, then the typical RISC arrangement of reading the values from code memory using a pointer register and 16-bit displacement is more code efficient (for 3 or more uses of the 32-bit data), but less bandwidth efficient (taking a 32-bit instruction and a 32-bit read, compared to a single 48-bit instruction).

Of course, that would require support for 48-bit instructions rather than just 32-bit, which might not be worth the cost.

My point is not that the CF is a RISC core - I never claimed it was. But neither is it a CISC core in comparison to, say, the x86 architecture. If there were such a thing as a scale running from pure RISC to pure CISC, then the CF lies near the middle. It is not as RISCy as the ARM, but is somewhat RISCier than the original 68k.

I agree CF is less CISCy than 68K but it is still more CISCy than x86.

I must have misread that - are you saying the CF (and 68k) is more CISCy than the x86 ??

If it dropped 2 memory operands, removed ALU+memory operations,
32-bit immediates and the absolute and (d8 + Ax + Ri*SF) addressing
modes then I would agree it is a RISC...


That's true - but then it would not be nearly as good a core. Just because there are some truly horrible CISC architectures, does not mean that all things RISC are better!

mvh.,

David


Wilco


.



Relevant Pages

  • Re: Calling convention odd?
    ... Large register sets are required with RISC instruction sets because the ... If one averages CISC work per instruction to a per clock basis so that they ... time was wasted by CISC in large cycle instructions didn't bear out as truth ... RISC designs would've been "screamers" ...
    (comp.lang.asm.x86)
  • Re: IBM 45nm -- new or licensed from Intel?
    ... However this increases power consumption for all fetches, ... all instructions). ... I had the impression that 2-way for a RISC was 1 general integer FU ... x86compilers for lower-end, embedded systems?) ...
    (comp.arch)
  • Re: Performance comparison Alpha ES40 vs Itanium rx3600
    ... Or has Intel had to incorporate a lot of RISC features in it to make it perform palatably when it realised that compilers cannot predict what real life applications will really be doing? ... CISC - each instruction does one complex thing consisting of many ... of the complexity of implementing these features with RISC instructions ...
    (comp.os.vms)
  • Re: 7011-220 performances
    ... RISC architecture ... the POWER RISC is superscalar, i.e. it can execute multiple instructions at once, and being RISC, integer instructions only take a single clock cycle ... I expect that the memory subsystem was faster and wider than the typical AT. ... you are also comparing a simple IDE drive with a decent SCSI one. ...
    (comp.sys.ibm.ps2.hardware)
  • Re: Happy Anniversary VMS - 30 years young
    ... With almost 20 years of RISC experience out there, are there debates on whether RISC is really that much better than CISC? ... When you look at the 8086 that implements CISC instructions that translate to multiple RISC instructions behind the scenes, isn't that faster in the end than having those multiple instructions in the executable image resulting in a larger executable image and thus more fetches from main memory to get the code to the CPU? ... At some point in time, even when you work with base address register and a displacement register, you will need to load actual values into those registers. ...
    (comp.os.vms)