Re: pushf v rcl, save, restore carry flag




stork wrote:

I'm writing my own large integer library in x86-64 assembly, and, to
start with, I'm working on addition. My basic approach is to loop
through each pair of 64 bit longs and adc them. The one thing I've
noticed is that for this to work I need to save and restore the carry
flag, as, my loop counter sets it too. What's the fastest way to do
that these days? I'm looking at pushf, popf, but, some sites that
I've looked at claims rcr/rcl ought to be faster as of 486 and
pentium. Is this still true?

I can only talk on AMDs, Intel CPUs may have this different.

Yes, PUSHF/POPF seem to have the worst timing (vectored 1+?/16).
RCL reg,1 (vectored 7) will beat PUSH/POP by far, but ...
LAHF(vectored 3) and SAHF(direct 1) pairs are faster, but must use AH.
SETcc reg (direct 1) and TEST reg,imm (direct 1) may be the fastest
and can use any available byte register and time almost even to
LAHF/SAHF for SETcc[mem]/TEST[mem] pairs.

At the moment I use LAHF/SAHF for up to 512 bit calculation,
and I think to use SETcc AL/TEST AL,1 pairs for the next upgrade

I'm looking at the AMD64 Architecture guide, Vol 3, and it doesn't
give much about timings at all, although it describes the instructions
reasonably well. Is there a document out there that gives some sort
of an idea of clock ticks for instruction (like the old days), or is
it that today's processors are so massively pipelined that going by a
ticks per instruction isn't going to cut it and you really need to
think in terms of everything else you have going on?

Yes, we are asked to keep many things in mind and are almost lost
when exact timing calculation on x86 code would be required ...

But for a raw estimation of code duration (worst case)
I use the latency/throughput figures from AMD 40546.pdf:
'Software Optimization Guide for AMD Family 10h Processors'
and AMD 25112.pdf:
'Software Optimization Guide for AMD64 Processors'

__
wolfgang



.



Relevant Pages

  • Re: Whats faster?
    ... the MS-Win95 timing loop would bug out around 350 MHz. ... It caused AMD considerable embarrassement, ... they slowed the K7 down to 8 clocks. ...
    (comp.lang.asm.x86)
  • Re: M2A-VM Memory problem??
    ... doing high precision timing. ... This description is from the AMD download page: ... that bypass the Windows API for timing by directly using the RDTSC (Read ... that a dual core owner should know about. ...
    (alt.comp.periphs.mainboard.asus)
  • Re: LOOP - Why so slow?
    ... LOOP takes more clocks when all other user instructions have taken ... I would find it hard to believe AMD ... perhaps AMD (and later Intel) should not have slowed LOOP. ... or CD after CPU upgrade, ...
    (comp.lang.asm.x86)
  • Re: LOOP - Why so slow?
    ... maximum jump displacement of -128 to +127 bytes. ... i can't ever remember using or seeing a LOOP that jumped forward... ... also true, unless the assembler supports macros, then the loop can be ... AMD tried optimizing everything because of the speed rush with Intel, ...
    (comp.lang.asm.x86)
  • Re: cl.exe (x86 for amd64) bugs
    ... extend if the lower 32 bits are modified by an instruction. ... of some register and when your process is resumed, ... an interesting mix of promotions, non-promotions, default operand size, zero ... I just looked at the AMD web site and see some updates and new info on the ...
    (microsoft.public.development.device.drivers)