Re: New ARM Cortex Microcontroller Product Family from STMicroelectronics





On Jun 24, 11:45 am, wilco.dijks...@xxxxxxxxxxxx wrote:
On 23 Jun, 03:10, rickman <gnu...@xxxxxxxxx> wrote:
I don't follow what you are saying at all. Branch prediction
relates
to pipelining. I don't see how it relates to wait states.

Adding a wait state is the same as increasing the pipeline depth, and
branch
prediction coupled with prefetching can hide some of that latency.

I don't see how that is true at all. When you add a waitstate you
freeze all stages of the pipeline while you wait for the Flash to
finish the access.

I don't know exactly how the Cortex work, but I worked on the internals
of another 32 bit RISC core.

Thanks for the rundown on this alternative CPU. Sounds a bit like the
National 32 bit CPU with variable length instructions. That was
supposed to be a fast CPU, but not a commercial success. If there had
been a longer term commitment, it may have grown in popularity. But
the realities of the commercial CPU market allowed it to pass on to
the CPU boneyard.

It was not a Series 32000 CPU. The Series 32000 has (IIRC)
instruction sizes which varied between 2 and 21 bytes.

I.E. movzbd x(y(sp))[r0:d], z(w(sb))[r1:d]

with all displacements beeing 30 bits.

This core had a 16 byte FIFO in the first pipeline stage.
The prefetch mechanism loaded 32 bits into this FIFO each access.
The memory controller could add waitstates to this access if neccessary.

...snip...

Since most instructions are 16 bits, and you read 32 bits at a time,
zero waitstate operation allows to fetch almost two instructions per
cycle.
The FIFO will quite soon be filled and if the odd 32/48 bit instruction
pops
up,
it wont hurt your performance.

No, the "odd" 48 bit instruction won't hurt performance, but the FIFO
already has had a negative influence anytime the instruction sequence
is not linear. It is, in terms of the negative effect, like adding
pipeline stages. The entire FIFO has to be flushed anytime you
branch.

The FIFO is implemented using Flip-Flops and you had a
simple three stage pipeline (fetch, decode,execute) so
your latency was not dramatic.


If you have one waitstate, you will see that the bandwidth is still high
enough that 1MIPS/MHz can be maintained as long as you only
execute 16 bit instructions. You will be hurt by fetching a 32 bit
instruction
since that takes 2 clocks.

Even executing 16 bit instructions takes a 1 clock cycle hit on a
branch. Instead of having the next instruction in the FIFO, you have
to wait 2 clock cycles before you can start decoding it.


Yes, but if the jumps are probably only 10-20% of all instructions
so you lose only between 10-20% of the performance instead of 50%.
The AVR32 loses less than 10% in average.


I have run the SAM7 at 48 MHz, zero waitstate. Does not work over the
full
temp range though.
The AVR32 will support 1.2 MIPS/MHz @ 1 waitstate operation @ 66 MHz
due to its 33 MHz 2 way interleaved flash memory.
(1st access after jump is two clocks, subsucquent accesses are 1 clock)

How does that compare to the Cortex M3 running at 50 MHz with no
waitstates and no branch penalty?


The UC3000 is claimed as 80 MIPS at 66 MHz.
For the Cortex M3 to reach 80 MIPS at 50 MHz,
you have to have 80/50 = 1,6 MIPS per MHz.
I think that ARM does not claim that the Cortex is close to 1,6 MIPS per
MHz.

The AVR32 is decidedly better on DSP algorithms due to its
single cycle MAC and also it has faster access to SRAM.
Reading internal SRAM is a one clock cycle operation on the AVR32.
Bit banging will be one of the strengths of the UC3000.

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB


.



Relevant Pages

  • Re: Double-Checked Locking pattern issue
    ... I understand generally how pipeline works. ... In short, while some CPU can retire four instructions per clock, there ... separate flow of execution instead of reordering a single flow. ...
    (microsoft.public.vc.language)
  • Re: Double-Checked Locking pattern issue
    ... I understand generally reorder instructions to fully utilize pipeline is a ... other code is in the function, what parts of the CPU are being used. ... because although memory reads and writes may be reordered, the pipeline ... will see the intermediate states of a reordering. ...
    (microsoft.public.vc.language)
  • Re: Double-Checked Locking pattern issue
    ... I understand generally how pipeline works. ... Out of order execution is reordering in the CPU, not the compiler, to make ... In short, while some CPU can retire four instructions per clock, there ...
    (microsoft.public.vc.language)
  • Re: Opteron versus P4
    ... that this CPU could execute thre FADD instructions in parallel, ... It has throughput 1 for FADD and this means that there is one pipeline ... measure a throughput of 1 per cycle on code that blends these instructions. ...
    (borland.public.delphi.language.basm)
  • Re: input & output in assembly
    ... > ie. for the above pipeline, up to 5 instructions can be being ... prodided there are no conflicts. ... > Conflicts, such as AGI stalls, cause pipeline bubbles. ... > I hope this has undone some of the confusion/damage your ...
    (comp.lang.asm.x86)