Re: Automatic parallelization - was Re: LISP Object Oriented?



On 2 Feb 2007 01:57:34 -0800, "Tim Bradshaw" <tfb+google@xxxxxxxx>
wrote:

On Feb 1, 11:16 pm, George Neuner <gneune...@xxxxxxxxxxx> wrote:

Ok, I agree with you that the interconnect is important - specifically
the cache coherence system. However, you seem to be assuming that the
interconnects of all these new multi-core chip are naively
implemented.

Not at all. I'm assuming (to put it simple-mindedly) that there is
only so much you can do with the number of wires you can get out of a
single package, and (critically) with the interconnect available from
that package to main memory and the design of that memory system, and
that dealing with that problem is an expensive issue which has
ramifications far beyond the design of the package alone.

That's true. But again, mid-level desktops today are sporting memory
systems which are equal to 1990's level differentials between CPU and
memory. High end units have relatively more memory bandwidth than
their old counterparts. I persist in bringing up the 90's because
that's when most of the publicly available SMP performance studies
were done.

Now, the situation changes as you add CPUs, but the hit each CPU takes
is critically dependent on it's stall characteristics, cache
performance and on the width of it's memory path. Using Intel's
numbers for the Pentium 4, a snoopy design running over PC6400 should
be able to support up to 8 2GHz CPUs. As you noted before, there are
other, more complex, coherence systems that are more performant and
other CPUs better suited to multiprocessor use - I am deliberately
picking on a relatively poor choice.

The situation can improve with cores. With separate CPUs, the on-chip
L1 caches can't be snooped - the snoop circuitry is on (or above) the
external L2 cache, requiring that the L1 cache be write-through so it
can be (indirectly) monitored and lengthening each coherent write
cycle because of the transmission delay in going off chip. With
multiple cores, the snoop circuitry can be on-chip where it can
monitor the L1 caches directly and performance can be improved by
making the L1 caches write-back.


Think about something like disks. If I produce a disk which can
transfer data four times as fast as my previous design, what does that
mean to you as an integrator?

It means nothing unless I have to upgrade the bus and chip set to
handle it.


Additionally, putting the interconnect
on-chip should be much cheaper and significantly faster than the
equivalent mechanism for separate CPUs.

If only it were that simple. Again think of the disk analogy: what
does a typical desktop disk system look like? What does a typical (non-
redundant, for the sake of argument) enterprise system of the same
capacity look like? Why?

Are you kidding? A typical desktop has connections for 4 SATA drives
with hardware RAID 0 and RAID 1, plus 4 ATA-133 drives and 2 floppies
- the fact that nobody uses all of it is irrelevant. Leaving aside
hot swap capability, today's desktop disk systems don't look all that
much different from high performance rack systems. Despite what you
may have heard, SOL storage systems are quite rare except in
supercomputer centers - most enterprise storage is on UltraSCSI or
IEEE 1394 (Firewire), both of which are only moderately more
performant than 3Gb/s SATA.

No drives can sustain transfers at anywhere near the speed of any of
these interfaces. Not to mention that the correcting RAID systems
used in enterprise storage are slightly less performant than
non-correcting ones.


I come from a high performance background ... I have worked on and
with multiprocessors for many years - mostly SMP, but I've also worked
with DM-SIMD (CM-2) and DM-MIMD (CM-5) and with clusters. In one of
my former lives, I was part of a team that designed and programmed a
proprietary image processing board using a DSP (used as a FP
microcontroller) to configure and sequence 4 FPGA processing elements
with symmetric point-to-point links, running over a dual ported, 320
MB/s _sustained_ throughput memory system built out of 100Mhz SDRAM.

I do know something of the subject ... I'm not just regurgitating the
hype from PC Week.

George
--
for email reply remove "/" from address
.



Relevant Pages

  • Re: Future memory modules
    ... >> seems to be poorly designed for an L3 memory. ... I was thinking that a larger burst length would allow the chip design ... >> Such a huge cache presents other problems, ... the tags would have to be ...
    (comp.arch)
  • Re: Has anyone produced a board using Kicad?
    ... memory is being pushed to maintain lists and objects ... provoke substantial cache thrashing, which will show up as memory ... can you quantify how large a design must be ... before it begins to hit memory limits when using gEDA/PCB? ...
    (sci.electronics.cad)
  • Re: Sine wave look up table
    ... Where do you find enough memory ... what's the likelihood of cache miss times the cache ... Near the end of the design process (somewhere between design and ... start out designing a system based on cache latency issues. ...
    (comp.dsp)
  • Re: [patch 0/6] mm: alloc_percpu and bigrefs
    ... David S. Miller a écrit: ... with no per-cpu or per-node additional memory but got no comment. ... It's important to place mostly read parts together, so that a cache lines can ... bus trafic between CPUS. ...
    (Linux-Kernel)
  • Re: [PATCH 1/24] make atomic_read() behave consistently on alpha
    ... CPU can hold that data in cache as long as it wants before it writes ... it to memory. ... not-yet-written volatile value. ... Communicating both with interrupt handler and with other CPUs. ...
    (Linux-Kernel)