Re: Atmel releasing FLASH AVR32 ?




"Ulf Samuelsson" <ulf@xxxxxxxxxxxxx> wrote in message news:etqnn5$v66$1@xxxxxxxxxxx
"Wilco Dijkstra" <Wilco_dot_Dijkstra@xxxxxxxxxxxx> skrev i meddelandet
news:2y_Lh.16902$NK3.2627@xxxxxxxxxxxxxxxxxxxxxxx

"Ulf Samuelsson" <ulf@xxxxxxxxxxxxx> wrote in message news:etp769$te9$1@xxxxxxxxxxx

That's true, but function calls are common too and they would typically
branch between pages. And then you have the nasty case of a function
or a loop split between 2 pages...

Fixed by compiler pragma...

Easy to say, a bit harder in reality. If you don't care about codesize you could
align big functions to 512-byte boundaries and pack small functions in the
gaps. But even that is hardly a solution as every minor change in the code
results in a different memory layout making performance unpredictable.
Basically it is an unsolveable problem.

On an ARM7, adding a cache also adds on waitstate to all non-cache accesses.

No, a cache doesn't impact other accesses to non-cacheable
memory areas. A local flash cache is something you could
just drop into an existing design without even worrying about
needing to turn it on or flush it. It's completely transparent.

Similarly, branch prediction makes a CPU go faster and so it burns less
power to do a given task. Cortex-M3 has a special branch prediction
scheme to improve performance when running from flash with wait
states, so it makes sense even in low-end CPUs.

Branch prediction cost is chasing an ever eluding target.

Branch prediction is pretty trivial as branches are very predictable.
A small global branch predictor (for example as used in the ARM1156)
gives an amazing good prediction at a neglegible hardware cost.

With multithreading you can swap in a computable process and use EVERY cycle.

So what? There are few wasted cycles on modern embedded CPUs.
Only very high-end CPUs are waiting a lot for slow memory.

Multithreading is not relevant in the embedded space, it would add a lot of complexity
and die area for hardly any gain.

Yes it is, just look at a mobile phone, lots of ~20 MIPS CPUs handling
Bluetooth, WLAN, GPS etc , just because noone has designed
a proper multithreading for embedded.

No, phones are extremely integrated and usually have only one CPU,
one DSP and perhaps a micro controller in the flash card.

Hardware multithreading doesn't give much performance on a high
end CPU, and it gives almost no benefit on a low end one. Less than
10% of the memory bandwidth is unused in an ARM7, so running a
second thread either means it runs at 10% of the maximum speed
or it slows down the main thread.

It really only makes sense on high-end
CPUs, but even there the gains are not that impressive.

If you believe that, you dont understand multithreading for embedded.
The purpose is not to increase performance, it is to improve real time
response so you do not have to have multiple CPUs.

You don't understand multithreading at all. Interrupt latency is completely
unaffected by multithreading. Whether you run 2 interrupts in parallel at
half the speed or one after the other at full speed is irrelevant.

You confuse multiprocessing with multithreading. A 2-core CPU can
indeed deal with 2 interrupts in parallel at full speed.

Adding more
cachelines evens this effect out, making performance more predictable.

No, your unpredictability comes from jumping to a place
and instead of accessing memory, to fetch the page
you have a cache hit, and then your timing is screwed.

It is impossible to run code at a predictable speed, so you're
screwed no matter whether you use a cache or not.

A cache can even reduce worst case performance since it
can introduce delays in the critical path.

So would a page cache. That is the price you have to pay when
improving performance: the best case is better but the worst
case is typically worse. Overall it is a huge win.

No it is not a win if you have to guarantee that a job completes
in a certain time.

Wrong. Code is highly repetitive, so even if you assume the cache
is invalidated at the start of a task, using a cache results in much
faster execution.

The cache in itself draws power, and you cannot compare
accesses to cache compared to accesses to flash memory.

Of course the cache burns power, but you're not using the flash.
Which uses less power is highly dependent on their size and
implementation. From what I've heard, caches are extremely
efficient for sequential accesses - ie. code accesses.

You have to run the cached CPU at a higher clock frequency to compensate
for loss of worst case performance.

No, it would be virtually impossible to find code that actually can't
meet its deadline with a cache.

Wilco


.



Relevant Pages

  • Re: Parallelization on muli-CPU hardware?
    ... that's just as perfect for multiprocessing as ... understand the focus on multithreading vs multiprocessing, ... CPUs do not share caches, the CPU-affinity issues (of processing units ... If two CPUs share some level of cache (as some multi-CPU designs ...
    (comp.lang.python)
  • Re: Atmel releasing FLASH AVR32 ?
    ... Cache and branch prediction is waste of energy and gates. ... adding a cache also adds on waitstate to all non-cache accesses. ... so it makes sense even in low-end CPUs. ... while you are waiting for the flash access to complete. ...
    (comp.arch.embedded)
  • Re: New libc malloc patch
    ... > someone steps up to change the way mmap and brk interact within the ... > be allocated with brk. ... we already have systems running with enough CPUs that this is an issue. ... > address space and the cache: the mapping of logical pages (what you ...
    (freebsd-current)
  • Re: Purchasing the correct hardware: dual-core intel? Big cache?
    ... there's not enough IO to stress the disk subsystem. ... with more CPUs by getting true dual-core pentiums. ... The question this all pivots on is will 8M of cache be a significant ... We're looking hard at getting either Intel dual-core procs, ...
    (freebsd-questions)
  • Re: Is it time to stop research in Computer Architecture ?
    ... Path mispredicts & cache misses were a couple of the gating factors, but so were niggling little details such as store-queue sizes, retire resources & rename buffer sizes. ... The multilevel branch predictor techniques - some of which I pioneered but did not publish (apart from a thesis proposal for the Ph.D. ... Since I know Daniel reads this newsgroup, perhaps he would care to say what he thinks about multilevel branch prediction now? ... Take one of the academic papers, or take my patent pending, or, if you are at AMD or Intel, use one of the techniques that I invented at those places which I can't use. ...
    (comp.arch)