Re: Opteron versus P4
From: Dennis (marianndkc_at_home3.gvdnet.dk)
Date: 04/22/04
- Previous message: Cleber: "Re: Opteron versus P4"
- In reply to: Cleber: "Re: Opteron versus P4"
- Next in thread: Dennis: "Re: Opteron versus P4"
- Reply: Dennis: "Re: Opteron versus P4"
- Reply: Cleber: "Re: Opteron versus P4"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Thu, 22 Apr 2004 08:58:14 +0200
Hi Cleber
We can take two approaches to investigate matters. First we can trust
official AMD and Intel documentation. This is ok I believe because I have
never found any errors in it.
I wrote in an earlier post that the P4 has 4 FP execution units + 1 FP Move.
This is seen and read at page 46 and in 443
ftp://download.intel.com/design/Pentium4/manuals/24896610.pdf
Ahtlon 64 has 3 execution units for FP instructions called FADD, FMUL and
FMISC. Page 274
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/2511
2.PDF
This answers this question
>For me, Athlon has a three-way fully pipelined FPU. Paul said that P4 has
> only one pipeline for FPU. What's wrong?
Paul is wrong regarding the P4. The reason is probably that the Intel
drawing at page 46 shows one FP execution unit only. He is not the first one
that was mislead by this.
The second approach is to write dedicated code to investigate matters. We
have partly done this. The big question was: "Do Athlon have 3 full FP units
running in parallel?" Many believed this. If it was so then we would measure
that this CPU could execute thre FADD instructions in parallel, but it can
not. It has throughput 1 for FADD and this means that there is one pipeline
for FADD. This is the same for P3, PM, P4 and Opteron. We can repeat this
for FSUB, FMUL and FDIV. The next question is whether these pipelines are
fully independant. If for example FSUB and FADD shares a pipe then we will
measure a throughput of 1 per cycle on code that blends these instructions.
If there are two pipes we will measure a 2 per cycle throughput. This
throughput could however be limited by port bandwith (or decode/trace cache
bandwith, scheduler bandwidth or reorder bandwidth). I will write some more
code and release it later today.
Real world code will contain some amount of parallism and the CPU core's job
is to extract it. If there is a lot of FADD instructions and they are
depending on data from previous instructions then performance will be
limited by latency. If there are enough independant instructions to fill the
pipeline then performance will be limited by througput. Fully pipelined
means that throughput is 1 cycle. FADD is fully pipelined on all modern
processors so highly parallel code will run equally well on all these
processors. Only thing that differentiates performance is clockspeed and P4
is a winner. It is however more normal that code is not very parrallel and
here latency matters and P4 is a clear loser clock by clock and has not
always enough extra clock to keep up.
The FADD pipeline is 3 stages in P3, 5 stages in Northwood, 6 in Prescott
and 4 in Opteron. Fully pipelined on all. P3 comes at 1400, Northwood at
3400, Prescott at 3400 and Opteron at 2400. For fully parallel code Prescott
and Northwood are winners followed by Opteron and P3.
For serial code you can calculate the number of instructions per second as
the clock rate divided by the number of cycles it takes to execute one
instruction.
3400 / 5 = 680
2400 / 4 = 600
3400 / 6 = 567
1400 / 3 = 467
P4 Northwood is a winner followed by Opteron, Prescott and P3. Intel has
chosen to keep numbers for Pentium M secret.
Regards
Dennis
- Previous message: Cleber: "Re: Opteron versus P4"
- In reply to: Cleber: "Re: Opteron versus P4"
- Next in thread: Dennis: "Re: Opteron versus P4"
- Reply: Dennis: "Re: Opteron versus P4"
- Reply: Cleber: "Re: Opteron versus P4"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|