Re: optimation, a black art?
- From: George Neuner <gneuner2/@/comcast.net>
- Date: Tue, 15 Jul 2008 01:53:24 -0400
On Mon, 14 Jul 2008 03:20:30 -0700 (PDT), Vend <vend82@xxxxxxxxxxx>
wrote:
On 14 Lug, 09:00, George Neuner <gneuner2/@/comcast.net> wrote:
On Sat, 12 Jul 2008 01:06:32 -0700 (PDT), Vend <ven...@xxxxxxxxxxx>
wrote:
Most of the complication in modern x86 processors was due to the
binary backwards compatibility constraint: they expose a CISC
instruction set implemented on a RISC-like internal architecture.
RISC is no less problematic - the MIPS architecture almost failed
initially because it depended too much on the compiler for scheduling
and did not include pipeline interlocks to delay operations when
operands were not ready. Fortunately the designers realized their
mistake and included interlocking pipelines in subsequent generations
of the chip.
If I remember correctly it still requires NOPs to fill the delays in
jumps, doesn't it?
AFAIK, no RISC ever _required_ NOPs, but it was left up to the
compiler to find useful instructions that could be executed in the
delay slot.
The real problem was that many early designs executed the delay slot
instructions regardless of whether the branch was taken. Finding
useful instructions that would be valid in both cases proved to be
very difficult in general. RISC designer responded by introducing
speculative execution and by allowing compilers to tag delay slot
instructions as being conditional or not. Conditional instructions
would be executed but their results would be held until the branch
target was resolved and thrown away if the branch went against them.
Intel made a similar mistake with the i860. The i860 was VLIW with 3
pipelines and depended entirely on the compiler to find and pack
together non-conflicting instructions to be issued simultaneously.
The i860 was a failure. The i960 was an integer-only version which
found a niche in embedded processing - removing the FPU pipeline made
it much easier to program than the i860.
Multiple issue RISCs like the m88K and PPC are quite difficult to
generate high quality code for. They are somewhat easier than the x86
to generate adequate code for ... but when highly optimized code is
needed, all platforms are a bitch and the differences are only in the
degree.
An internal FPGA exposed to the programmer would be the optimal
solution maybe.
Long term that might be a good suggestion, but IMO in the short term
it would probably make things worse because few programmers can work
effectively in VHDL and current compilers from high level languages
into VHDL, or directly to configuration binaries, are still pretty bad
efficiency wise as compared to a hand design.
Years ago I worked on a compiler for a board level processor that
consisted of a DSP with attached FPGAs. Knowing that compiling HL
code into VHDL was a losing proposition, we took a different approach.
We provided direct support for parallel array, matrix and convolution
ops (1,2 & 3D) in the compiler and implemented them with a library of
hand optimized configurations. The programmer hardly needed to be
aware of the FPGAs - the compiler constructed configuration parameter
blocks and scheduled execution. The FPGAs were interrupt devices so
the DSP program continued to run in parallel (we used the DSP for
control, serial integer code, and some floating point support because
the FPGAs at the time were too small for complicated FP codes). FPGA
ops could be executed single step with return to the DSP code each
time, or could be chained together for serial execution. In either
case, the compiler handled almost all the details behind the scenes.
There was only one issue for chained operation that we could not make
transparent and that was where the data buffers were allocated - all
buffers for an operation had to be in separate memory banks for
maximum speed and sometimes the programmer had to intervene to achieve
this. We probably could have solved that, but we never got around to
it.
The project was canceled when MMX Pentiums became fast enough to do
real time vision applications with OTS hardware (around 2001). Our
DSP/FPGA board was 10..50 times faster than Pentiums of that era, but
most of our applications didn't need such blinding speed and the
customers would not pay premium for custom hardware that would easily
outpace their requirements. So the whole thing died.
AFAIK, only one other project has taken a similar approach - a
university effort from Stanford. But IMNSHO, our system was better -
not to mention 10 years earlier and commercial.
George
[btw: I'm permitted to discuss general aspects of the system software
and compiler, but many particulars remain confidential and I have no
code available.]
--
for email reply remove "/" from address
.
- References:
- optimation, a black art?
- From: John Thingstad
- Re: optimation, a black art?
- From: George Neuner
- Re: optimation, a black art?
- From: Robert Maas, http://tinyurl.com/uh3t
- Re: optimation, a black art?
- From: George Neuner
- Re: optimation, a black art?
- From: Vend
- Re: optimation, a black art?
- From: George Neuner
- Re: optimation, a black art?
- From: Vend
- optimation, a black art?
- Prev by Date: Re: deserializing alien objects from a stream
- Next by Date: Re: lisp prevalence layer
- Previous by thread: Re: optimation, a black art?
- Next by thread: Re: optimation, a black art?
- Index(es):
Relevant Pages
|