Re: Blue Chip Technology + MagnumX?



John Devereux wrote:
David Hearn <dave@xxxxxxxxxxxxxxxxxxxx> writes:

John Devereux wrote:
David Hearn <dave@xxxxxxxxxxxxxxxxxxxx> writes:

Has anyone got any experience of Blue Chip Technology
(http://www.bluechiptechnology.co.uk/) and especially their MagnumX
single board computers?
(http://www.bluechiptechnology.co.uk/single_board_computer.php?sub_group_id=2)

We've been investigating the ColdFire MCF5475EVB for some quite simple
5 channel GPIO capture - but at quite high data rates. We've got the
266MHz ColdFire processor to do what we want but it just isn't quite
fast enough - it's missing some of the state transitions on the
channels. We're sampling at around 2 to 2.2MHz and we need more like
3.3MHz.

We were wondering whether a faster (2x?) processor would be more
likely to do the job. The 733MHz MagnumX board sounds like it might
do this, and it has 16 channels of GPIO (we only need 5 inputs at
present, unlikely to increase significantly). I'm aware it's moving
from the ColdFire/M68K processor to an x86 processor, but unless
there's a major difference in per cycle processing, that's not a
problem.

Our app is quite simple - loop until a counter expires, each iteration
store 1 byte pin state register and 32 bit counter value. We
currently can sample up to 60MB of data.
Can you not use DMA instead? And/Or perhaps an external FIFO.
Personally, I have no idea. I've never used DMA before (I'm
traditionally a desktop app guy!) - so I'm not sure where I would
start, or what I'd need to do. Essentially, we just want to move a 8
bit GPIO register containing pin status and a 32 bit register
containing a counter value into RAM as fast as possible - at around
3.3MHz (sample every 300ns).

Paul Carpenter beat me to to most of the points I would make. I would
just wonder whether you actually need the counter - if the samples are
being aquired reliably then perhaps the sample number can give you the
time?

Yes, if the samples are being taken at guaranteed periodic intervals then use of the counter is not needed.

Originally to save RAM (we 'only' have 64MB on board), we only logged the pin state if it had changed, but this added a little processing into the loop (thus making it slower), and also required the use of the counter to get timing values as the time for each iteration would change depending on whether the pin state had changed. When the pin state was the same the loop was shorter, when it had changed it was longer.

To try and speed things up (ie. reduce memory accesses and conditionals), we started just sampling the pin state, but not filtering out any non-changes - therefore making the code path in each iteration of the loop identical. This, in theory, should give the same sample rate each time, once the 'fudge' factor (time taken for each loop iteration) is calculated.

I believe this worked well - however the sample rate was still only fractionally faster than 2MHz - not close to the (approx) 3.3MHz we'd like.

Anyway, DMA uses special hardware facilities to move data around,
rather than doing it in software. Look it up in the data*** for your
coldfire chip. You should be able use the output of a timer to trigger
a DMA transfer from a PIO port to memory, at a programmed rate.

I'll look into that. The issue I found yesterday whilst looking at some generic DMA stuff is that the amount of code to initiate a DMA transfer is significant, much more than what is currently occurring in the loop. Whilst the memory transfer might be faster, the actual processing to initiate the transfer will be greater.

I guess this is what Paul meant by using a deep FIFO (I'm still learning!). This can be used to buffer up the samples such that the DMA transfer can be done on a larger block, rather than the 1 (or 5) bytes I'm currently processing each time.

Am I correct in saying that using DMA to transfer 1 or 5 bytes (if including a counter) at a time is likely to have more overhead than just doing a normal array indexed write?

We originally tried using a timer to generate interrupts and to sample at a constant rate, however, we found that the sampling speed was far slower than using a simple while() loop - too slow for our needs. To further complicate matters, turning full optimisation on (using gcc for M68k) caused interrupts to break. Using full optimisation made a noticeable difference in speed when doing the simple while() loop.

Unfortunately I have little control over modifications of the hardware (ie. adding FIFOs etc). Whilst I can advise what might be a better off the shelf choice, custom changes to boards is unlikely to happen at this stage (proof of concept and initial development).

It sounds like this application depends more on the GPIO rate rather
than the CPU core clock speed.
Yes, I think this also affects the speed of capture - however, we
found the following approx from memory benchmarks:

Sample 8 bit GPIO register, copy into RAM: 2.2MHz
Sample 32 bit slice timer register, copy into RAM: 2.15MHz
Sample 8 bit GPIO + 32 bit slice timer register, copy into RAM: 1.8MHz
Sample nothing (and store nothing) but otherwise same code: 4MHz

Have you turned on the cache?

Branch and instruction caches are turned on, data cache is turned off as I've so far been unsuccessful in getting the CACR (Cache Access Control Register) to mask off the registers from the cache. Turning it on means we get the same values back from the counter and pin status registers (as expected if they're being cached).

Turning on the branch and instruction cache made a huge difference. However, due to to the fact we're writing up to 60MB to RAM (without reading back) over a few seconds, I can't see how adding the data cache will significantly improve the performance. Most, if not all the local array indexes etc, which are written/read the most are in registers already it seems.

This to me suggests that the bottleneck is either memory or reading
the registers.

We've also benchmarked how fast we can toggle the GPIO when configured
for output (rather than input) and I think it was 50ns level state (so
every 50ns the pin output toggled). The fastest we need to sample at
is around 300ns. So unless input capture (via polling) is 6 times
slower than output, I can't see that currently the GPIO speed is a
problem.

Of course, when considering another architecture, we need to ensure
that the GPIO speed is considered.

I'm certainly no true embedded engineer - I'm traditionally a PC software engineer who hadn't had to worry about low level code before. In the (small) company, I probably have the most 'embedded' experience through doing some set top box software development (using provided SDKs not requiring low level hardware interfacing). This has been my first true embedded development project, and had a steep learning curve but has been fun. My knowledge of architectures and embedded hardware is very limited. The Coldfire board was selected by someone else (who's moved on now!) and if I can find a more suitable arhitecture/board then that would be good. I'm not in a position to design any hardware really - just select something that's already out there.

I have no experience of selecting, programming or interfacing with FPGAs, but I have had it suggested before that something like this might be better suited to this task. The problem is, I'm not a hardware design guy.

I've even had a suggestion along the lines of "just use a memory controller chip and use a clock to transfer the pin status into the RAM directly" (or something like that). Sounds a great idea - but that's not something I can do with my current skills.

I appreciate all your help - but really, at the moment, we're limited to using an off-the-shelf processor board - it's just knowing which is suitable. We're 2/3 of the speed we need at present - so not a huge amount to make up - it feels as if we should be able to do it using our current 'design' - just wondering whether other hardware would have the speed increase we're looking for.

Thanks

David
.