Re: Opteron versus P4

From: Dennis (marianndkc_at_home3.gvdnet.dk)
Date: 04/22/04

  • Next message: Dennis: "Re: Opteron versus P4"
    Date: Thu, 22 Apr 2004 08:58:14 +0200
    
    

    Hi Cleber

    We can take two approaches to investigate matters. First we can trust
    official AMD and Intel documentation. This is ok I believe because I have
    never found any errors in it.

    I wrote in an earlier post that the P4 has 4 FP execution units + 1 FP Move.
    This is seen and read at page 46 and in 443

    ftp://download.intel.com/design/Pentium4/manuals/24896610.pdf

    Ahtlon 64 has 3 execution units for FP instructions called FADD, FMUL and
    FMISC. Page 274

    http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/2511
    2.PDF

    This answers this question

    >For me, Athlon has a three-way fully pipelined FPU. Paul said that P4 has
    > only one pipeline for FPU. What's wrong?

    Paul is wrong regarding the P4. The reason is probably that the Intel
    drawing at page 46 shows one FP execution unit only. He is not the first one
    that was mislead by this.

    The second approach is to write dedicated code to investigate matters. We
    have partly done this. The big question was: "Do Athlon have 3 full FP units
    running in parallel?" Many believed this. If it was so then we would measure
    that this CPU could execute thre FADD instructions in parallel, but it can
    not. It has throughput 1 for FADD and this means that there is one pipeline
    for FADD. This is the same for P3, PM, P4 and Opteron. We can repeat this
    for FSUB, FMUL and FDIV. The next question is whether these pipelines are
    fully independant. If for example FSUB and FADD shares a pipe then we will
    measure a throughput of 1 per cycle on code that blends these instructions.
    If there are two pipes we will measure a 2 per cycle throughput. This
    throughput could however be limited by port bandwith (or decode/trace cache
    bandwith, scheduler bandwidth or reorder bandwidth). I will write some more
    code and release it later today.

    Real world code will contain some amount of parallism and the CPU core's job
    is to extract it. If there is a lot of FADD instructions and they are
    depending on data from previous instructions then performance will be
    limited by latency. If there are enough independant instructions to fill the
    pipeline then performance will be limited by througput. Fully pipelined
    means that throughput is 1 cycle. FADD is fully pipelined on all modern
    processors so highly parallel code will run equally well on all these
    processors. Only thing that differentiates performance is clockspeed and P4
    is a winner. It is however more normal that code is not very parrallel and
    here latency matters and P4 is a clear loser clock by clock and has not
    always enough extra clock to keep up.

    The FADD pipeline is 3 stages in P3, 5 stages in Northwood, 6 in Prescott
    and 4 in Opteron. Fully pipelined on all. P3 comes at 1400, Northwood at
    3400, Prescott at 3400 and Opteron at 2400. For fully parallel code Prescott
    and Northwood are winners followed by Opteron and P3.
    For serial code you can calculate the number of instructions per second as
    the clock rate divided by the number of cycles it takes to execute one
    instruction.

    3400 / 5 = 680
    2400 / 4 = 600
    3400 / 6 = 567
    1400 / 3 = 467

    P4 Northwood is a winner followed by Opteron, Prescott and P3. Intel has
    chosen to keep numbers for Pentium M secret.

    Regards
    Dennis


  • Next message: Dennis: "Re: Opteron versus P4"

    Relevant Pages

    • Re: Opteron versus P4
      ... said that is possible execute 3 fadd in parallel on Athlon. ... pipeline for FADD unit, other one for FMUL unit and other for store, etc. ... > that this CPU could execute thre FADD instructions in parallel, ...
      (borland.public.delphi.language.basm)
    • Re: Big OOO, SpMT, and possible designs (Was Re: Free/Open x86 Sim)
      ... I think that we may be trying to push the physical analogy too far here. ... the pipeline, plus the number of instructions at that stage in the pipeline possibly being considered a second ... instructions that are in flight, and executing, in buffers. ... You can get a high net flow rate with a linpack benchmark ...
      (comp.arch)
    • Re: input & output in assembly
      ... > ie. for the above pipeline, up to 5 instructions can be being ... prodided there are no conflicts. ... > Conflicts, such as AGI stalls, cause pipeline bubbles. ... > I hope this has undone some of the confusion/damage your ...
      (comp.lang.asm.x86)
    • Re: Double-Checked Locking pattern issue
      ... I understand generally reorder instructions to fully utilize pipeline is a ... I understand generally how pipeline works. ... Out of order execution is reordering in the CPU, not the compiler, to make ... any of the other things that new CPUs use to retire multiple instructions ...
      (microsoft.public.vc.language)
    • Re: Detecting the start of a BASIC line
      ... its dependant on the pipeline length. ... instructions in the remainder of the pipeline already fetched from ... ARM2 3 stages (no chcage so branches have additional cycle penalty) ... than if they were all conditional, as conditionals can't be skipped so always ...
      (comp.sys.acorn.programmer)