Re: Optimization

From: Paul Hsieh (qed_at_pobox.com)
Date: 02/11/04


Date: 10 Feb 2004 22:05:41 -0800

Martin Eisenberg wrote:
> Paul Hsieh wrote:
> > So the computation, believe it or not, takes either 1 clock, or
> > 16 clocks depending on the success or failure of the branch
> > prediction. Assuming a 50% prediction rate this works out as (1
> > + 16)/2 = 8.5 clocks.
>
> Is the 50% assumption the best we can do without diving deep into the
> specifics of any particular call site?

No, 50% is in fact the statistically worst performance of the
predictor. I just picked it as an example. For the predictor to
perform well, the sequence of branch directions has to either follow a
short pattern, or a have a probabilistic bias (i.e., 90% taken, versus
10% non-taken, will tend to be predicted fairly well regardless of the
pattern the branches come in.)

Very often the branch is *very* predictable. For example if you want
to find the minimum element of a very large array that's randomly
sorted, then the predictor will quickly lean very heavily to assuming
that each successor is not the new minimum, and after roughly (n / e)
(where e = exp(1) = Napier's constant) elements on average the predict
will lock correctly.

In those cases you can just weight the two possibilites (well
predicted versus not) according to the probability of your branch.
 
> > Assuming a previous generation compiler (like MSVC):
>
> Is VC 7.x "previous generation" as well? I hear its optimizer is much
> improved over version 6.

Who did you "hear" this from? Microsoft marketing perhaps? Look,
Intel has *EMBARASSED* Microsoft with its truly amazing compiler. MS
is also starting to feel pressure from gcc which has also improved by
leaps and bounds in the past 5 years. I'm sure they have been working
on their compiler and have convinced themselves that they are the
greatest thing since slice bread, but Intel has left them (and
everyone else) so far behind its not funny. (Intel has spent a small
fortune in hiring the absolute best compiler creators in the industry.
 Remember that Intel doesn't rely on the revenues from their compiler
to stay afloat. So Microsoft cannot apply any kind of competitive
pressure to make Intel stop.)

I have not used VC 7.x personally, so I cannot say anything
authoritative about that compiler. But previous versions did not emit
cmovCC or any of the other post Pentium instructions outside of inline
assembly.
 
> > The P6/Athlon CPUs support conditional move instructions like
> > "cmovl" which will directly translate flag results to a kind of
> > ?: operation.
>
> Ah, so I've actually misremembered my processor's age. I don't see a
> full instruction manual at AMD's documentation site,

Oh they've got one somewhere around there.

> [...] but I dare infer
> from your comment in conjunction with Intel's reference that the
> Athlon also has FP conditional moves.

They have FCOMI, but I don't remember about other new FP instructions.
 AMD has some kind of capabilities bits program and associated
documentation somewhere that you can use to test their presence with.
 
> > When you are in the floating point world, and using a processor
> > like the Athlon which takes time to communication between the
> > integer and FP parts of the CPU, then the situation is just a
> > little more murky.
>
> I guess that's "a little" as in the "little bit" you know about
> optimization, the extent of which your quite interesting site
> reveals ;) By the way, how do FCOMI and relatives impact that
> situation?
 
An AMD insider informed that they put some significant work into FCOMI
and that its supposed to be fairly fast. From looking at the
disassembly that you showed in another post it looks like one of your
solutions uses such instructions to avoid the transition to the
integer-side of the CPU entirely. If that's the case, then that would
be yet another case that would need to be looked at.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/


Relevant Pages

  • Re: AMD vs Intel for video format conversions and editing
    ... Linux) compiler rather than the Intel one. ... basic x86 code rather than the fancy high-performance extended x86 ... they've spent a lot of effort in getting their basic x86 instructions ...
    (comp.sys.ibm.pc.hardware.chips)
  • Re: LOOP - Why so slow?
    ... of complicated ones with multiple side effects, so "loop" wasn't ... loop instructions that branch forward). ... This is no big deal for a compiler. ... BS that Intel made up. ...
    (comp.lang.asm.x86)
  • Re: Math library
    ... Intel C++ compiler can do automatic vectorization (using MMX,SSE,SSE2 ... instructions) of "properly-written" C code.. ...
    (comp.programming)
  • Call for Participation: CGO-5, 11-14 March 2007 - San Jose, California / Online Regist
    ... Programming a Massively Parallel Processor" ... Workshop on EPIC Architectures and Compiler Technology ... Code Generation and Optimization for Transactional Memory Constructs ... Cheng Wang (Intel Corporation), Wei-Yu Chen ...
    (comp.programming)
  • Re: Question about intel_VEC_memcpy
    ... Syntax such as array assignment and matmul() is highly productive of temporaries, some of which could be avoided by better optimization in the compiler. ... If you are willing to work with a current version of ifort, and to submit a case to Intel support, there is likely to be scope for improvement. ... a compiler can't avoid the allocation of a temporary array for the intermediate result. ...
    (comp.lang.fortran)