Re: Cell Architecture Explained (MASSIVE AMOUNT OF INFO)

From: Robert Myers (rmyers1400_at_comcast.net)
Date: 01/22/05


Date: Sat, 22 Jan 2005 09:53:57 -0500

On Sat, 22 Jan 2005 01:53:48 GMT, Maynard Handley <name99@name99.org>
wrote:

>In article <QvadnatFwfzw6W3cRVn-ow@comcast.com>,
> "Xenon" <xenonxbox2@xboxnext.com> wrote:
>
>> The lack of cache and virtual memory systems means the APUs operate in a
>> different way from conventional CPUs. This will likely make them harder to
> ^^^^^
>> program but they have been designed this way to reduce complexity and
>> increase performance.
>
>You don't say.
>Programming Itanic was a picnic compared to programming this thing; at
>least Itanic used a traditional computer architecture.
>And yet Intel/HP, with all the money in the world, couldn't make it fly.
>Please tell us why IBM/Sony/Toshiba can do what Intel/HP could not.
>

Itanium and Cell both offer advantages for problems that can be
formulated to exploit the architecture. In the case of Itanium, the
advantages have turned out not to be overwhelming. In the case of
stream processors, there are already off-the-shelf GPU's that can
significantly outperform any conventional microprocessor for some
kinds of problems, and the advantage of stream processors will only
grow as feature sizes decrease.

>(Note, I am not denying that Cell may make a fine Playstation chip.
>I AM denying that it will make a fine workstation chip, will take over
>the computing world, make all other CPUs obsolete, blah blah blah.)
>

Predicting the future is really hard. Genuine paradigm shifts are
rare, but I think this one is on its way. The future of computing is
more like what happens on network processors and GPU's than what
happens on x86, PowerPC, or Itanium. The change is being driven by
physics, not marketing.

>> This may sound like an inflexible system which will be complex to program
>> and it most likely is but this system will deliver data to the APU registers
>
>So in return for giving up cache, your code has to manually move data
>to/from memory. That'll be easy for the compiler to figure out.
>

Of course it won't. But the same problem exists--how do I figure out
how to get the data to where I need it when I need it?--in any
architecture. Cache and registers add a set of tools for dealing with
that problem; they don't make it go away. In the case of at least
some stream processors, there is a _register_ hierarchy: a
low-bandwidth stream register file that faces memory and local
register files that act much like a conventional vector register.

<snip>

>
>There's (much much, so much) more blather and ranting about how how
>fantastic Cell is and how it will solve any problem you can possibly
>imagine, but for those of us in the reality-based community, I think the
>points I have extracted above are the highlights.
>
>Bottom line is that this thing doesn't resemble any traditional CPU and
>is therefore a godawful match to existing languages, compilers and
>algorithms. Unless IBM/Sony/Toshiba have, in some other pocket, and kept
>an extremely good secret that solves problems many people have been
>working on for more than twenty years, you'll be programming this thing
>with an assembly language mindset, even if you are nominally using a
>high-level language --- like you program AltiVec today. Only it'll be so
>much more fun because not only will you be worrying about alignment and
>algorithm issues, you'll be trying to juggle fitting your instructions
>and data into local memory (we weren't given a size for this but if it
>is to run at L1 cache speeds, it can't be wildly far off from say 64K to
>512K bytes); none of that getting the cache to just hide the problem for
>you if you might want to load from an infrequent used table, handle a
>rare exception condition or whatever; it'll be manual segment swapping
>all over again. Not to mention the other glorious aspects. You'll be
>using some bizarro method to handle coherency. You'll have the engine
>that drives your code and handles exceptions and such running on a
>different processor from where the compute intensive code lives.
>

Maybe. Somebody likes programming these things because people are
already doing it--just for fun, apparently.

The problems are formidable, but it is early days yet when it comes to
inventing programming models and algorithms for stream processors.

One future I can see is that data (and instructions) will no longer be
associated with memory locations but with labelled packets.

There will always be something that looks like a conventional
microprocessor? Let's wait and see what the promised workstations
look like. Weren't we supposed to have seen them last fall?

The one thing in all this that _really_ gives me pause is that making
it work in the general case seems like getting a dataflow machine to
work in the general case.

There's a really nice summary of GPU programming entering the
mainstream at

http://www.computer.org/computer/homepage/1003/entertainment/

>And all this from IBM/Sony/Toshiba, three companies traditionally known
>for their openness and willingness to share with the public. I imagine
>Intel, AMD and Microsoft are quaking in their boots.
>

They couldn't possibly be less open than the graphics card
manufacturers have been.

RM



Relevant Pages

  • Re: Calculating checksums...
    ... - it's a word memory reference using a register address ... lods - it's the lods instruction ... parens are a memory reference. ... programming and why, with the end of the 68k, teaching assembly ...
    (alt.lang.asm)
  • Re: locking and cache coherence
    ... Note that no cache is needed for this to happen, all that's needed is out of order execution of instructions, which definitely occurs on modern processors. ... A compiler is free to generate the code for the function executed by thread A in such a way that the "foo" value is held in a processor register and only spilled to the "foo" location in memory when either the register is needed for something else or the function returns to its caller. ... Naturally, thread B won't see any change to the "foo" memory location until thread A has spilled the register to memory, which may occur some time later than the two HLL statements have executed, and won't necessarily be before the time that the "foo_valid" memory location is updated. ...
    (comp.programming.threads)
  • Re: Kind of new: function implementation questions, MASM
    ... In 32-bit programming, you should simply ignore these concepts. ... A stack is just a piece of memory. ... >indicate it's a register definition. ... >the data segment to be a block of memory that I can access. ...
    (comp.lang.asm.x86)
  • Re: Beyond multicore
    ... |> memory put there under program control. ... call the cache the memory, the memory the backing store, a cache line ... A modern cpu is effectively, from a programming viewpoint, close to identical to a 20-30 year old mainframe. ...
    (comp.arch)
  • Re: Algorithm design: computational cost of ordinary stack operations (dup, rot, over, swap, etc.)
    ... You should probably look at Philip Koopman's book "Stack computers" ... conventional register machine, you get code comparable to what a C ... registers they should fit in the fastest CPU cache. ... It's basically about 'locality' rather than memory access per se. ...
    (comp.lang.forth)