Re: Automatic parallelization - was Re: LISP Object Oriented?
- From: George Neuner <gneuner2/@comcast.net>
- Date: Sat, 03 Feb 2007 05:04:50 -0500
On 2 Feb 2007 01:57:34 -0800, "Tim Bradshaw" <tfb+google@xxxxxxxx>
wrote:
On Feb 1, 11:16 pm, George Neuner <gneune...@xxxxxxxxxxx> wrote:
Ok, I agree with you that the interconnect is important - specifically
the cache coherence system. However, you seem to be assuming that the
interconnects of all these new multi-core chip are naively
implemented.
Not at all. I'm assuming (to put it simple-mindedly) that there is
only so much you can do with the number of wires you can get out of a
single package, and (critically) with the interconnect available from
that package to main memory and the design of that memory system, and
that dealing with that problem is an expensive issue which has
ramifications far beyond the design of the package alone.
That's true. But again, mid-level desktops today are sporting memory
systems which are equal to 1990's level differentials between CPU and
memory. High end units have relatively more memory bandwidth than
their old counterparts. I persist in bringing up the 90's because
that's when most of the publicly available SMP performance studies
were done.
Now, the situation changes as you add CPUs, but the hit each CPU takes
is critically dependent on it's stall characteristics, cache
performance and on the width of it's memory path. Using Intel's
numbers for the Pentium 4, a snoopy design running over PC6400 should
be able to support up to 8 2GHz CPUs. As you noted before, there are
other, more complex, coherence systems that are more performant and
other CPUs better suited to multiprocessor use - I am deliberately
picking on a relatively poor choice.
The situation can improve with cores. With separate CPUs, the on-chip
L1 caches can't be snooped - the snoop circuitry is on (or above) the
external L2 cache, requiring that the L1 cache be write-through so it
can be (indirectly) monitored and lengthening each coherent write
cycle because of the transmission delay in going off chip. With
multiple cores, the snoop circuitry can be on-chip where it can
monitor the L1 caches directly and performance can be improved by
making the L1 caches write-back.
Think about something like disks. If I produce a disk which can
transfer data four times as fast as my previous design, what does that
mean to you as an integrator?
It means nothing unless I have to upgrade the bus and chip set to
handle it.
Additionally, putting the interconnect
on-chip should be much cheaper and significantly faster than the
equivalent mechanism for separate CPUs.
If only it were that simple. Again think of the disk analogy: what
does a typical desktop disk system look like? What does a typical (non-
redundant, for the sake of argument) enterprise system of the same
capacity look like? Why?
Are you kidding? A typical desktop has connections for 4 SATA drives
with hardware RAID 0 and RAID 1, plus 4 ATA-133 drives and 2 floppies
- the fact that nobody uses all of it is irrelevant. Leaving aside
hot swap capability, today's desktop disk systems don't look all that
much different from high performance rack systems. Despite what you
may have heard, SOL storage systems are quite rare except in
supercomputer centers - most enterprise storage is on UltraSCSI or
IEEE 1394 (Firewire), both of which are only moderately more
performant than 3Gb/s SATA.
No drives can sustain transfers at anywhere near the speed of any of
these interfaces. Not to mention that the correcting RAID systems
used in enterprise storage are slightly less performant than
non-correcting ones.
I come from a high performance background ... I have worked on and
with multiprocessors for many years - mostly SMP, but I've also worked
with DM-SIMD (CM-2) and DM-MIMD (CM-5) and with clusters. In one of
my former lives, I was part of a team that designed and programmed a
proprietary image processing board using a DSP (used as a FP
microcontroller) to configure and sequence 4 FPGA processing elements
with symmetric point-to-point links, running over a dual ported, 320
MB/s _sustained_ throughput memory system built out of 100Mhz SDRAM.
I do know something of the subject ... I'm not just regurgitating the
hype from PC Week.
George
--
for email reply remove "/" from address
.
- Follow-Ups:
- Re: Automatic parallelization - was Re: LISP Object Oriented?
- From: Juan R.
- Re: Automatic parallelization - was Re: LISP Object Oriented?
- From: Tim Bradshaw
- Re: Automatic parallelization - was Re: LISP Object Oriented?
- From: Marc Battyani
- Re: Automatic parallelization - was Re: LISP Object Oriented?
- References:
- Re: Automatic parallelization - was Re: LISP Object Oriented?
- From: Tim Bradshaw
- Re: Automatic parallelization - was Re: LISP Object Oriented?
- From: George Neuner
- Re: Automatic parallelization - was Re: LISP Object Oriented?
- From: Tim Bradshaw
- Re: Automatic parallelization - was Re: LISP Object Oriented?
- Prev by Date: Re: Print a (
- Next by Date: Re: Automatic parallelization - was Re: LISP Object Oriented?
- Previous by thread: Re: Automatic parallelization - was Re: LISP Object Oriented?
- Next by thread: Re: Automatic parallelization - was Re: LISP Object Oriented?
- Index(es):
Relevant Pages
|