Re: AMD CodeAnalyst MASM only?
From: Matt Taylor (para_at_tampabay.rr.com)
Date: 03/12/04
- Next message: Scott Moore: "Re: Software Protection and Anti Crack code"
- Previous message: Jean Dupont: "multi segment with masm"
- In reply to: marg: "Re: AMD CodeAnalyst MASM only?"
- Next in thread: marg: "Re: AMD CodeAnalyst MASM only?"
- Reply: marg: "Re: AMD CodeAnalyst MASM only?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Fri, 12 Mar 2004 07:45:43 +0000 (UTC)
"marg" <mapierSpamMustDie@juno.com.net> wrote in message
news:opr4nb83qdjpv40k@news.qwest.net...
> Thanks Matt! I think you saved me a bunch of pointless dinking around.
> So the dependency chain gives you a worst-case to compare against? Thats
> cool.
>
> Also I think I've been timing things wrong because I havent put the CPUIDs
> in there.
>
> I want to try to make an animated model in OpenGL of the pipeline, if it
> turns out the rules are simple enough. Cause it seems like by seeing it
> operate, a person could understand the issues instantly.
<snip>
They aren't. Gantt charts are usually used for this sort of thing, but the
analysis is done by hand because the processor pipeline is a very complex
thing to simulate.
When writing optimized code, it is helpful to document cycles in the margin.
For instance, the original Pentium had 2 pipelines (U & V) and a few rules
that limited when instructions could dispatch together. Here's an example of
Pentium-optimized code right out of the C runtime for Microsoft C++
(memcpy.asm):
mov al,[esi+3] ;U - get first byte from source
;V - spare
mov [edi+3],al ;U - put first byte into destination
mov al,[esi+2] ;V - get second byte from source
mov [edi+2],al ;U - put second byte into destination
mov eax,[dst] ;V - return pointer to destination
The rules of the game have changed somewhat since those days. Modern CPUs
can execute instructions out of order, so it is a little more difficult to
predict when an instruction will execute since it is based on what was
executing beforehand.
Usually I'm lazy with my Athlon-optimized code and I group instructions
together by which cycle they retire in. To be thorough, I would document the
unit(s) used, decode cycle, execute cycle, and retire/writeback cycle.
Here's an example:
mov eax, [edx] ; LS 0/ 0/ 3
mov ebx, [edx+4] ; LS+AGU 0/ 1/ 3
mov ecx, [edx+8] ; LS+AGU 0/ 1/ 4
or eax, [edx] ; LS+ALU 1/ 2/ 5
add eax, ebx ; ALU 1/ 5/ 6
add eax, ecx ; ALU 1/ 6/ 7
xor ebx, ecx ; ALU 2/ 4/ 5
lea eax, [eax+ebx] ; ALU+AGU 2/ 7/ 9
lea ecx, [ecx*2+4] ; ALU+AGU 2/ 4/ 6
mul eax ; MUL 3/ 7/13
First I name which units are used. This I mostly do for Pentium-4; on Athlon
the limitations are pretty straightforward. Athlon can execute up to 3 ALU +
3 AGU ops per cycle, so these will never cause a resource contention
problem except when ops are stalled waiting on data. The cache can only
service 2 requests per cycle, but this is rarely a problem.
Next I have the decode field. Athlon can decode up to 16 bytes per cycle
(rarely a problem) and up to 3 DirectPath instructions per cycle. VectorPath
instructions decode at a rate <= 1 per cycle. I believe some of the longer
ones like bsf take multiple cycles. (Is this proportional to the number of
macro-ops emitted?)
After you have all the decode fields complete, execute and retire fields
come next. An instruction executes when (1) it is decoded (2) all of its
data is ready and (3) it can access the units it needs. Once you know when
it executes, you add the latency to compute when it retires. The retire
cycle of the last instruction is the total latency for your block of code.
-Matt
- Next message: Scott Moore: "Re: Software Protection and Anti Crack code"
- Previous message: Jean Dupont: "multi segment with masm"
- In reply to: marg: "Re: AMD CodeAnalyst MASM only?"
- Next in thread: marg: "Re: AMD CodeAnalyst MASM only?"
- Reply: marg: "Re: AMD CodeAnalyst MASM only?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|