Re: Float/SSE optimization on Athlon/P4

From: Matt Taylor (para_at_tampabay.rr.com)
Date: 01/16/04

  • Next message: Matt Taylor: "Re: Float/SSE optimization on Athlon/P4"
    Date: Fri, 16 Jan 2004 14:19:01 +0000 (UTC)
    
    

    "Gian-Carlo Pascutto" <natrese@hotmail.com> wrote in message
    news:tiNNb.14234$Jm6.3574884@phobos.telenet-ops.be...
    > "Matt Taylor" <para@tampabay.rr.com> wrote in message
    > news:PkJNb.1941$Bj.1744@twister.tampabay.rr.com...
    >
    <snip>
    > > > l2: sar edx, 16
    > > > movss xmm0, [edi + edx*4]
    > > > add eax, iadd
    > > > mulss xmm0, [esi]
    > > > mov edx, eax
    > > > add esi, 4
    > > > addss xmm1, xmm0
    > > > dec ecx
    > > > jnz l2
    > > >
    > > > Any improvement possible here?
    > >
    > > The theoretical maximum is 8 bytes per 2 cycles (2 fadd + 2 fmul per
    cycle
    > /
    > > 1 load + 1 store per cycle). Your loop is doing about 4 bytes in 10
    > cycles.
    > > I'd like to see more details both on iadd & ecx. That's key in unrolling
    > the
    > > loop.
    >
    > iadd is a constant which gets computed outside of the time critical code
    > (Which is actually a 16.16 fixed point number. It was a float originally
    > but doing it this way allowed me to use SAR instead of more float to int
    > mess.)
    >
    > ecx is computed by the setup code in the original post. Expected typical
    > range is 6, 7 or 8.
    <snip>

    One thing I would try is to use ax for only the fractional part of the 16.16
    number. Then you can keep edx as the integer part and use adc to carry into
    edx. That piece of code now becomes:
    add ax, iadd_lo
    adc edx, iadd_hi

    This is smaller/faster than what you had before:
    add eax, iadd
    mov edx, eax
    sar edx, 16

    I would completely unroll this on an AthlonXP because of the large icache.
    Once unrolled, you can schedule appropriately. The mov/mul/accumulate is
    your critical path (10 clk), so schedule around that to minimize its
    latency. You should be able to get the latency of an iteration down to 2-3
    cycles which is 3-5 times faster than what you had before.

    -Matt


  • Next message: Matt Taylor: "Re: Float/SSE optimization on Athlon/P4"

    Relevant Pages

    • Re: Easter Eggs and Security
      ... introducing security vulnerabilities. ... by a tiny easter egg interfere with? ... Maybe those clock cycles /could/ be better used. ...
      (comp.security.misc)
    • Re: Making a hash of things...
      ... different algorithm and is of little interest to me at ... xor edx, edx ... add esi, 1 ...
      (alt.lang.asm)
    • Re: How to speed up a hash function
      ... > xor edx, edx ... > xor eax, eax ... movzx ecx, cl; ecx = pb ... This code takes about 14 cycles whereas ...
      (comp.lang.asm.x86)
    • Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2
      ... On 3/19/07, Andrew Morton wrote: ... please enable it and retest. ... [snip, snip] ... But somehow eax and edx have the same value 0xc1800000 here. ...
      (Linux-Kernel)
    • Re: where can I find the algorithmus of AES
      ... > This doesn't seem to support your claim above in my view. ... As said, Devine showed that the ... calaculated and also the cycles used for each single key schedule operation. ...
      (sci.crypt)