Re: optimation, a black art?



On 14 Lug, 09:00, George Neuner <gneuner2/@/comcast.net> wrote:
On Sat, 12 Jul 2008 01:06:32 -0700 (PDT), Vend <ven...@xxxxxxxxxxx>
wrote:



On 12 Lug, 06:41, George Neuner <gneuner2/@/comcast.net> wrote:
On Fri, 11 Jul 2008 10:57:52 -0700,
jaycx2.3.calrob...@xxxxxxxxxxxxxxxxxxxxxx (Robert Maas,

http://tinyurl.com/uh3t) wrote:
I've never done (at a practical level) any of the kinds of analysis
you describe below, but I believe I have a basic understanding of
the logic involved, so I'll offer some guesses:
From: George Neuner <gneuner2/@/comcast.net>
I've done a lot of low level optimization for real time software.
Optimization at the assembler level has certainly gotten harder.
Most architectures today have multiple pipelines and fast running
instructions may be completely subsumed by slower running ones.

So presumably for each block of code, such as the body of a
subroutine or a clause of a conditional or loop, you have a matrix
showing which operations need to be completed before which other
operations can start, the essential nature of a PERT chart, right?

Only partially. Trace [*] scheduling (also called Basic Block
scheduling or List scheduling) is analogous to PERTing - the problem
for the compiler is figuring out what instructions to include in each
of the traces. Remember that the register set is finite and a fetch
from memory may take 50..10,000 times longer than fetching from a
register. A fetch of a value not in a register should be placed far
enough in advance of the value's use so as not to stall the pipeline.
To do this the compiler has to estimate where the value lies in the
memory/cache hierarchy.

The end result is that what appears to be a simple logical instruction
sequence may really end up as many small sequences separated by
dozens, or even hundreds, of other instructions which are competing
for registers, functional units and memory bandwidth.

[*] A trace or basic block is a list of instructions to be executed
sequentially, with no internal branches and, at most, one branch or
jump at the end.

Many chip manufacturers no longer document the cycle count for
instructions because it is variable depending on the sequencing,
so peephole optimization becomes a cycle of trial and error.

Computers are so fast nowadays that I imagine it's trivial, given
the dependency matrix, to generate all permutations of the
operations that are consistent with that matrix, then for each such
permutation generate the actual sequence of instructions following
that permutation in a block of memory, then for each such block
make a million copies and put a system-clock-read before the first
copy and after the last copy, then execute each such mega-block of
code and collect the clock-time-difference, repeat ten or twenty
times to get a good statistical sample, compare the statistics from
those various permutations, and report which sequences vie for
fastest, and the whole analysis and report generation can be
completed within a few seconds, with no operator intervention
needed from start to finish, right? So from your viewpoint, you
edit the matrix, click the permute-analyze button, wait a few
seconds as the progress bar scrolls from 0% to 100%, and up comes
the report showing you which permutation was selected and how it
compares with the other, and most of the time you don't even look
at the details of the report, you just click on the OK button to
automatically copy that permutation to your peephole optimization
database, and you can check off one more micro-algorithm-pattern
that has been optimized, right?

No it's not trivial. Depending on the instruction set, for any
particular high level construct, there may be several templates for
implementing it. When high level constructs are combined, the
combinatorial explosion quickly makes trying them all impractical.
Add a finite register set and memory fetch sequencing and the number
of interacting traces quickly becomes unmanageable. Add cache
pollution from a multitasking OS and the problem becomes NP hard - ie.
essentially unsolvable.

Scheduling for a modern, multiple issue processor is, in fact, so
complicated that compilers usually cannot do an optimal job. Almost
all modern CPUs support out-of-order execution - the CPU scans ahead
in the instruction stream looking for things to do to fill latency
holes left by the compiler. The hardware figures out what can be done
and when based on the actual state of the pipelines at the time -
something the compiler can only estimate and then only for sequential
programming - not for multitasking.

I suggest you read up on modern chip architectures and get yourself a
good book on compilers.

Yet the IA-64 architecture requires the compiler to explicitely
schedule the instructions, IIRC.

All chips require some scheduling - you can't just throw in a bag of
unordered instructions and expect a program to emerge.

Yes. The point is that most modern superscalar processors expose an
interface of a sequential processor and internally schedule the
sequential instructions to multiple parallel execution units. Most of
the complications come form the need to ensure that the semantics of
the instruction sequence is preserved.

The IA-64 architecture does kind of the contrary: it exposes a fairly
large number of logical parallel execution units, which may be greater
than the number of physical execution units, and lets the compiler
explicit schedule instructions for these logical units and take care
of any dependancy constraints.

The point is that what the compiler does amounts to a "suggested"
scheduling - the actual final execution order is left up to the chip
and may be different for each run of the program.

Most of the complication in modern x86 processors was due to the
binary backwards compatibility constraint: they expose a CISC
instruction set implemented on a RISC-like internal architecture.

RISC is no less problematic - the MIPS architecture almost failed
initially because it depended too much on the compiler for scheduling
and did not include pipeline interlocks to delay operations when
operands were not ready. Fortunately the designers realized their
mistake and included interlocking pipelines in subsequent generations
of the chip.

If I remember correctly it still requires NOPs to fill the delays in
jumps, doesn't it?

Intel made a similar mistake with the i860. The i860 was VLIW with 3
pipelines and depended entirely on the compiler to find and pack
together non-conflicting instructions to be issued simultaneously.
The i860 was a failure. The i960 was an integer-only version which
found a niche in embedded processing - removing the FPU pipeline made
it much easier to program than the i860.

Multiple issue RISCs like the m88K and PPC are quite difficult to
generate high quality code for. They are somewhat easier than the x86
to generate adequate code for ... but when highly optimized code is
needed, all platforms are a bitch and the differences are only in the
degree.

An internal FPGA exposed to the programmer would be the optimal
solution maybe.

.


Quantcast