Prescott Disappoints

From: Matt Taylor (para_at_tampabay.rr.com)
Date: 02/17/04


Date: Tue, 17 Feb 2004 07:07:35 +0000 (UTC)

We all know that Prescott got a shifter and multiplier, and there were a
number of overhauls to the core. I was poking through the latest Intel
Optimization Manuals, and I am absolutely shocked. Here are a number of
downsides:

1. Multiplies still suck. They were 14-18 cycles before; now they're 10
cycles, and in some cases they aren't any faster. Since shifts are 1 cycle
now, you might as well emulate most constant multiplies.
2. No more double-speed ALU ops. I can't get a straight answer here, but all
of the ALU ops went from 0.5 clk latency/throughput to 1 clk. This means
that one of the key advantages was cut. I don't know if this means the ALUs
are no longer running double-speed, though I have heard this claimed.
3. New horizontal add instructions are useless. With a throughput of 4
cycles and latency of 13, it is faster to use shuffle + add.
4. Increased latencies of floating-point instructions. All adds/multiplies
are now 1 cycle (20%) slower. Likewise some of the mux instructions
(minps/maxps) and the conversion instructions are a cycle slower.
5. Slower L1 dcache (50%!)

What advantages does Prescott have? Well, they made a few less common
instructions encode in the u-op cache, and shifts/rotates are now 1 cycle.
All of the advantages are minor tweaks; seems to me they broke more
important stuff in the process.

-Matt



Relevant Pages

  • Re: WaitForSingleObject() will not deadlock
    ... I'd like to see the EXACT SEQUENCE OF INSTRUCTIONS issued in the locking sequence, ... My issue about the 2 CPU clock cycles is that once the lock is set, ... cycle detection using non-recursive mutex: ...
    (microsoft.public.vc.mfc)
  • Re: Semi-multithreaded application in VB6
    ... > Those libraries control a motion system and wait most of their time ... The cycle takes about 30 s to ... like your code is generating a series of instructions to the external machine. ... then go back to the send-each-instruction loop. ...
    (comp.lang.basic.visual.misc)
  • Re: AMD CodeAnalyst MASM only?
    ... that limited when instructions could dispatch together. ... can execute instructions out of order, so it is a little more difficult to ... unitused, decode cycle, execute cycle, and retire/writeback cycle. ... Next I have the decode field. ...
    (comp.lang.asm.x86)
  • Re: PIC + DDS = Frequency Synthesizer?
    ... thus the software calculation loop cycle time ... are 20 instructions available to update the phase accumulator, ... lookup and control the DAC. ... software loop cycle. ...
    (comp.arch.embedded)
  • Prescott versus Northwood
    ... Bsawp is 1 cycle latency in Prescott and 7 cycles latency in Northwood. ...
    (borland.public.delphi.language.basm)