Re: AMD CodeAnalyst MASM only?

From: Matt Taylor (para_at_tampabay.rr.com)
Date: 03/12/04


Date: Fri, 12 Mar 2004 07:45:43 +0000 (UTC)


"marg" <mapierSpamMustDie@juno.com.net> wrote in message
news:opr4nb83qdjpv40k@news.qwest.net...
> Thanks Matt! I think you saved me a bunch of pointless dinking around.
> So the dependency chain gives you a worst-case to compare against? Thats
> cool.
>
> Also I think I've been timing things wrong because I havent put the CPUIDs
> in there.
>
> I want to try to make an animated model in OpenGL of the pipeline, if it
> turns out the rules are simple enough. Cause it seems like by seeing it
> operate, a person could understand the issues instantly.
<snip>

They aren't. Gantt charts are usually used for this sort of thing, but the
analysis is done by hand because the processor pipeline is a very complex
thing to simulate.

When writing optimized code, it is helpful to document cycles in the margin.
For instance, the original Pentium had 2 pipelines (U & V) and a few rules
that limited when instructions could dispatch together. Here's an example of
Pentium-optimized code right out of the C runtime for Microsoft C++
(memcpy.asm):

        mov al,[esi+3] ;U - get first byte from source
                                ;V - spare
        mov [edi+3],al ;U - put first byte into destination
        mov al,[esi+2] ;V - get second byte from source
        mov [edi+2],al ;U - put second byte into destination
        mov eax,[dst] ;V - return pointer to destination

The rules of the game have changed somewhat since those days. Modern CPUs
can execute instructions out of order, so it is a little more difficult to
predict when an instruction will execute since it is based on what was
executing beforehand.

Usually I'm lazy with my Athlon-optimized code and I group instructions
together by which cycle they retire in. To be thorough, I would document the
unit(s) used, decode cycle, execute cycle, and retire/writeback cycle.
Here's an example:

 mov eax, [edx] ; LS 0/ 0/ 3
 mov ebx, [edx+4] ; LS+AGU 0/ 1/ 3
 mov ecx, [edx+8] ; LS+AGU 0/ 1/ 4
 or eax, [edx] ; LS+ALU 1/ 2/ 5
 add eax, ebx ; ALU 1/ 5/ 6
 add eax, ecx ; ALU 1/ 6/ 7
 xor ebx, ecx ; ALU 2/ 4/ 5
 lea eax, [eax+ebx] ; ALU+AGU 2/ 7/ 9
 lea ecx, [ecx*2+4] ; ALU+AGU 2/ 4/ 6
 mul eax ; MUL 3/ 7/13

First I name which units are used. This I mostly do for Pentium-4; on Athlon
the limitations are pretty straightforward. Athlon can execute up to 3 ALU +
3 AGU ops per cycle, so these will never cause a resource contention
problem except when ops are stalled waiting on data. The cache can only
service 2 requests per cycle, but this is rarely a problem.

Next I have the decode field. Athlon can decode up to 16 bytes per cycle
(rarely a problem) and up to 3 DirectPath instructions per cycle. VectorPath
instructions decode at a rate <= 1 per cycle. I believe some of the longer
ones like bsf take multiple cycles. (Is this proportional to the number of
macro-ops emitted?)

After you have all the decode fields complete, execute and retire fields
come next. An instruction executes when (1) it is decoded (2) all of its
data is ready and (3) it can access the units it needs. Once you know when
it executes, you add the latency to compute when it retires. The retire
cycle of the last instruction is the total latency for your block of code.

-Matt



Relevant Pages

  • Another high end 16/32 bit uC, Wide Vcc, Wide Temp
    ... shift and rotate instructions are always processed during one machine cycle independent of the number of bits to be shifted. ... Also multiplication and most MAC instructions execute in one single cycle. ... Serious Peripherals [1..63 bit SPI and UARTs] ...
    (comp.arch.embedded)
  • Re: AMD CodeAnalyst MASM only?
    ... > that limited when instructions could dispatch together. ... > can execute instructions out of order, so it is a little more difficult ... > together by which cycle they retire in. ... > unitused, decode cycle, execute cycle, and retire/writeback cycle. ...
    (comp.lang.asm.x86)
  • Re: Fetch One Byte Is A Big Mistake
    ... > fetch 128 bytes instead of one byte per cycle. ... I am surprised that nobody has yet pointed out that the CPU already does ... been able to decode multiple instructions in the same cycle. ...
    (comp.lang.asm.x86)
  • Re: The coming death of all RISC chips.
    ... page before the CPU can get any instructions. ... purpose hardware decode thats hundreds of cycles waiting. ... cycles decoding a page using software so you can execute more ... The cache controller can use cache lines small ...
    (comp.arch)
  • Core - how many instructions per cycle?
    ... "Intel's literature states that Core can, for example, execute a 128-bit packed multiply, 128-bit ... on condition code) all in the same cycle. ... That's essentially six instructions in one cycle—quite a boost from any previous Intel processor." ...
    (comp.arch)