Re: Handling high UDP throughput





Bill A. wrote:

Vladimir Vassilevsky wrote:

Bill A. wrote:

Vladimir Vassilevsky wrote:


The IP/UDP stack at 40MB/s is the substantial computing load. You
need a ~GHz class CPU with the appropriate memory and DMA
subsystems.

You can do 60MB/s easily with TCP to a 500MHz PowerPC even using a
WinXP PC as the host.

Only if you just send the same packet over and over in a dummy loop
and do nothing else.

Actually, my tests sending data and doing nothing with the data got me over 920MbS. You just can't throw together a system and do this. You won't get that with Linux or any other OS. The product that uses this sustains 540MbS with a 38kHz interrupt running using more than half the processor's power, so a lot goes on in the system but a lot of time is available for TCP/IP. The Ethernet driver was optimized, the memory movement was optimized (just using an inline memcpy that does a DMA transfer adds 30% to the effective speed), the IP checksum was in assembly, and a zero-copy TCP/IP stack was required.

This was with the Freescale QUICC 8349 so I concur with the other post - this processor can do it - it's designed as a communications processor.

A lot depends on what OS and TCP/IP stack are used no
the device, what is done with the data once received, and how much
time you can put into optimizing the system.

I'm not just saying you can do this because I think you can - I've
done it.

I've done 100Mbit Tx/Rx with BlackFin at 600MHz. Even the 12/12 MB UDP
traffic is the considerate amount of load. Copying between the
different buffers, calculation of the checksums, cache trashing etc.
etc. = all of that is not free and hogs the bus and CPU.


I didn't say it was easy. I didn't say a system like you used could do it. I'm only saying it is possible in an embedded device with a reasonable processor - you don't need ~GHz as you claimed.

What OS did you use? What stack? How much TX buffers did you have?

Our own OS, our own stack and MAC driver, 4/4 Rx/Tx buffers, 100/100 full duplex. It was found that there is generally no advantage in using more then 4 buffers; less then 4 buffers decreases the throughput.

How fast could the processor get the data to the MAC?

That is done by DMA. The speed depends on many factors.

Did you do zero-copy TCP/IP (it's very hard to do this with sockets)?

No, it has to copy the data. You have to do that not just because of sockets but since BlackFin doesn't have the automatic means to ensure cache - DMA coherency.

The QUICC buffer descriptor memory makes it very easy to send lots of data without processor intervention. Oh, I forgot, the Ethernet driver I wrote wasn't even interrupt driven.

So, the driver is blocking. No multitasking.

At those interrupt rates there was no improvement over simply polling for data. This may have been because when polling, the processor cache wasn't constantly being replaced by the Ethernet interrupt service routine.

The interrupt servicing up to the rates of ~hundreds kHz isn't a big problem in BlackFin. The context switch overhead is only ~200ns, and the interrupt code and data are located in L1, so there is no stalling because of the cache.


Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com
.



Relevant Pages

  • Re: Handling high UDP throughput
    ... need a ~GHz class CPU with the appropriate memory and DMA ... The product that uses this sustains 540MbS with a 38kHz interrupt running using more than half the processor's power, so a lot goes on in the system but a lot of time is available for TCP/IP. ... The Ethernet driver was optimized, the memory movement was optimized (just using an inline memcpy that does a DMA transfer adds 30% to the effective speed), the IP checksum was in assembly, and a zero-copy TCP/IP stack was required. ...
    (comp.arch.embedded)
  • Re: DMA question?
    ... The nonpnp I referred earlier describes how the buffers for different ioctl types are handled. ... that MDL to initalize a DMA transaction object and perform the DMA operation ... >> of the source and destination memory location. ...
    (microsoft.public.development.device.drivers)
  • Re: Translating VM address to physical addresses in a non-root app
    ... User application does ioctl to provide buffers, start, and stop ... The kiovec stays mapped (locked in memory) until the user stops the ... "continuous" DMA *and* tells the driver to release the buffers. ... Their solution for SG lists is similar to what I ...
    (comp.os.linux.development.system)
  • [USB] goku_udc: Remove crude cache coherency code
    ... * safe to use kmallocmemory for all i/o buffers without using any ... * cache flushing calls. ... * between dma and non-dma activities, which is a slow idea in any case.) ...
    (Linux-Kernel)
  • Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
    ... > It's not an issue of DMA coherency, it's an issue of DMA vs. interrupt ... I believe the WHD interrupt is arriving at the CPU before ... Which is what I sketched to Martin, as the reason to be interested ... DMA-coherent memory is defined as "memory for which a write by either ...
    (Linux-Kernel)