Re: Atmel releasing FLASH AVR32 ?




"Wilco Dijkstra" <Wilco_dot_Dijkstra@xxxxxxxxxxxx> skrev i meddelandet
news:fcUMh.24370$Lz4.2747@xxxxxxxxxxxxxxxxxxxxxxx

"Ulf Samuelsson" <ulf@xxxxxxxxxxxxx> wrote in message
news:etvukp$ioj$1@xxxxxxxxxxx

Why not use *REAL* data.

MIPS 34k core with 9 threads = 2,1 mm2 in 90 nm.
MIPS 24k core with 1 thread = 2,8 mm2 in 130 nm

It is probaly fair to assume that 90 nm = 0,5 * 130 nm
so a MIPS 34k would be about 4,2 mm2 in 130 nm
or about 50 % larger with 9 threads.

130->90nm scaling is more like 55-60%, so it is more likely to be
25% larger, not 50%. However consider these are high-end embedded
cores with 32KB cache, so the actual core area more than doubles.

From MIPS homepage:
" 2.1 mm2 (core only, extracted from full layout GDSII database)"


The MIPS 34k is actually a dual core (dual VPE), so you have to deduct
for that.

Actually it is a single core. A VPE is simply a virtual CPU to make the
OS believe there are 2 cores.

I think you will find that it is more like 10% overhead for a simple core

Wrong. On a micro controller with a far simpler pipeline it would be much
worse. A while ago we discussed the size of a register file in embedded
CPUs, this is 10-15% of a typical core like ARM7. Imagine 9 copies.

I meant per thread.
You do not need much more than the register file and prefetch buffer
so 10-15% extra per thread does not seem unreasonable.

A dual thread 40 MHz CPU can replace two 20 MHz CPUs.
A single thread 40 MHz CPU cannot always replace two 20 MHz CPUs.
Let's take an obvious case, where one is running the OSE operating system
and the other is running Thread/X.
How are you going to do that on a single thread?
The combined GPS and Bluetooth stack is better.
A GPS company would normally not allow anyone to mess
with the code running on the ARM.
The impact on support and maintenance is to high.

Running a thread with the GPS is much more attractive
and would allow the user to run their own threads without
affecting the GPS timing enough to be a problem.


It is less overhead for a multithreaded "faster" core than it is for
a single threaded "faster" core, if you accept the limitation
that a thread can only run max 1/2 or 1/3rd of the cycles
because you get rid of feedback muxes.
Less logic in critical datapath = higher frequency.

That is certainly feasible, but you'll have a hard time getting it past
marketing types who want to show good benchmarking results...
Single threaded performance is still important and will be for a long
time.

Not for a 20 MIPS application, it aint.
There is noone interested in how many MIPS the cpu
core in a GPS chip has.


No, zero cost context switch cores exist already today.
(And has existed for 20-30 years)

Can you mention one? I've seen the Ubicom cores but they
switch at the start of the (rather long) pipeline, so it takes many
cycles to switch.

Are you sure, they cannot switch every clock cycle?

Of course they can switch every clock cycle. But what matters is how
fast they can react to asynchronous events such as branch mispredicts,
cachemisses, wait for event etc. If a thread is scheduled to run but it
has an unexpected idle cycle, is it possible to immediately switch to
another thread and use that cycle?

Yes, when you have a jump you would immediately make this task
non--computable, and have another computable thread enter the pipeline.
If it becomes computable the next clock, you can switch it in again.

Remember a bubble may appear
at the end of the pipeline but instruction fetch is at the beginning, so
it can take a while...

MIPS 34k.

I don't have much information how threading works on the 34k, but
from what little is available, it appears each thread maintains a
separate instruction queue. This indicates they can switch pretty
quickly. I'd be impressed if it can switch to reclaim idle cycles.

Why not, the AVR32 removes jumps from the pipeline
so the execution unit will only see aritmetic instructions.


In a simple three stage pipeline is it a piece of cake to do what I want.
Main cost is:

PC is changed from a register to an SRAM.
Register Bank becomes a register bank array.
Multiple PSRs

and then you have the scheduling which can be
an advanced timer working on a register bank.

Each thread adds a time quanta every n cycles and deducts another time
quanta every time it gets to use the pipeline and you try to execute the
threads which have accumulated a lot of time quanta.
Not so hard to implement.

The concept is simple indeed, but the details are non-trivial,
especially if you want fast thread switching to use idle cycles.

I expect that in normal operation you will switch thread EVERY clock cycle.
It is becomes more complex if you want dynamic allocation of threads.

A real simple solution would be to have a circular buffer of programmable
size.
Each entry in the buffer, is a thread number.
So if you had a 10 entry circular buffer you could have

1,2,1,3,1,2,1,4,1,5

At 100 MHz, this would give you
Thread 1: 5 entries = 50 MHz
Thread 2: 2 entries = 20 MHz
Thread 3,4,5 = 1 entry each = 10 MHz

If a thread is not computable, then you can give the cycle to
one of the other threads, or to a dbug thread, or to a backgorund thread
or whatever.


When you say zero cost context switch, can you tell me how long it
would take to execute a "wait_for_event" instruction, the thread going
to sleep followed by the event being signaled immediately afterwards
followed by resuming execution of the next instruction? On the Ubicom
core I believe it takes around 10 cycles, far from zero...

The zero context switch time is between two different threads.
If you explicitly yield the thread, then it can take time
to stop/start, but in fine grained parallelism, you
execute for one clock and then the next clock another thread
executes.

Yes. But my point is that if it takes time to start/stop threads then this
is
equivalent to the interrupt latency. You can't claim that interrupt
latency
is bad for performance but that thread start/stop latency isn't. It lowers
the maximum performance of that thread (in your example of 40 SPI
devices it lowers the maximum SPI frequency) and if the CPU can fill
the idle cycles with another thread they also reduce overall performance.


No, but I say, that it does not reduce the total throughput of
a CPU that you have latencies.
Even with latencies, you can get a higher utilization of the pipeline
as long as there is at least 1 computable thread.
No bubbles in the pipeline, no branch prediction needed.
Branch prediction will improve the performance of a single thread
but it will not allow the CPU to execute more instructions.

I believe that a thread that replaces an interrupt is started
already at initialization, and put in an event wait state.
Since there is no context to save/restore, then the thread
can react much faster than an interrupt driven device.


But let's assume you have a CPU with a zero-cost context switch. Now I
assume a CPU with zero-cost interrupt latency. Is there really any
difference?

Show me one ;-)

Any multithreaded CPU with a zero-cost context switch will do. You're
claiming those exist, right? So zero-cost interrupt latency exists too.

I am not claiming that a multithreaded CPU has zero interrupt latency.
I am claiming that once it has been decided to switch thread
you can do it without any overhead. It is still going to take time
after an event has occured, before the decision has been made.

You were trying to prove that a single thread core is as good
as a multithreaded core, and now you are claiming that
a multithreaded core is as good as a multithreaded core , duh!

Again, show me a real CPU with zero cost interrupt latency


You will not be able to maintain a large number of equal prioritized
threads
unless you modify the concept of interrupts to be equal to
multithreading.

If I run a main thread and have a higher priority interrupt thread
servicing
interrupts using 100% of CPU time, do you agree it is identical to an
interrupt-based CPU? So an interrupt driven application can be as fast
as a multithreaded one.

If you go back to the case where you are servicing 40 slave SPIs
you will NOT get the same throughput in a single thread machine
simply because you have overhead in servicing the interrupt
and the fact that you will not interrupt another task which
has the same priority level.



Wilco



Do you EVER give up a lost cause?

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB


.



Relevant Pages

  • Re: Atmel releasing FLASH AVR32 ?
    ... Actually it is a single core. ... that a thread can only run max 1/2 or 1/3rd of the cycles ... switch at the start of the pipeline, ... You can't claim that interrupt latency ...
    (comp.arch.embedded)
  • Pipeline induced interrupt latency jitter
    ... I discovered on the TMS320F2812 DSC that I'm using to synthesize digital words that the CPU pipeline appears to cause jitter of the interrupt latency. ... fairly complex main loop code can result in 10 cycles jitter. ...
    (comp.arch.embedded)
  • Re: Dahm Locks
    ... also disabled the interval timer interrupt. ... decide to let another task have the CPU -- in fact, with Dahm locks, ... I just looked at the code for LOCK of an INTERLOCK. ... The interlock is an event -- it gets passed to PROCURE ...
    (comp.sys.unisys)
  • [PATCH] Update Documentation/DocBook/kernel-hacking.tmpl
    ... Kernel Hacking. ... not associated with any process, serving a softirq, tasklet or bh; ... For example, while a softirq is running on a CPU, no other ... but a hardware interrupt can. ...
    (Linux-Kernel)
  • Re: race on multi-processor solaris
    ... > want to block if the lock holder is not running. ... and there is a CPU structure for each CPU. ... interrupts") are handled by "interrupt threads", ... Before we set the waiters bit, we grab the lock protecting the lock's ...
    (comp.unix.solaris)