Re: Float/SSE optimization on Athlon/P4
From: Matt Taylor (para_at_tampabay.rr.com)
Date: 01/16/04
- Previous message: Matt Taylor: "Re: Creating a dis-assembler on my own, but would like to have some data to check it's working .."
- In reply to: Gian-Carlo Pascutto: "Re: Float/SSE optimization on Athlon/P4"
- Next in thread: Matt Taylor: "Re: Float/SSE optimization on Athlon/P4"
- Reply: Matt Taylor: "Re: Float/SSE optimization on Athlon/P4"
- Reply: Gian-Carlo Pascutto: "Re: Float/SSE optimization on Athlon/P4"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Fri, 16 Jan 2004 14:19:01 +0000 (UTC)
"Gian-Carlo Pascutto" <natrese@hotmail.com> wrote in message
news:tiNNb.14234$Jm6.3574884@phobos.telenet-ops.be...
> "Matt Taylor" <para@tampabay.rr.com> wrote in message
> news:PkJNb.1941$Bj.1744@twister.tampabay.rr.com...
>
<snip>
> > > l2: sar edx, 16
> > > movss xmm0, [edi + edx*4]
> > > add eax, iadd
> > > mulss xmm0, [esi]
> > > mov edx, eax
> > > add esi, 4
> > > addss xmm1, xmm0
> > > dec ecx
> > > jnz l2
> > >
> > > Any improvement possible here?
> >
> > The theoretical maximum is 8 bytes per 2 cycles (2 fadd + 2 fmul per
cycle
> /
> > 1 load + 1 store per cycle). Your loop is doing about 4 bytes in 10
> cycles.
> > I'd like to see more details both on iadd & ecx. That's key in unrolling
> the
> > loop.
>
> iadd is a constant which gets computed outside of the time critical code
> (Which is actually a 16.16 fixed point number. It was a float originally
> but doing it this way allowed me to use SAR instead of more float to int
> mess.)
>
> ecx is computed by the setup code in the original post. Expected typical
> range is 6, 7 or 8.
<snip>
One thing I would try is to use ax for only the fractional part of the 16.16
number. Then you can keep edx as the integer part and use adc to carry into
edx. That piece of code now becomes:
add ax, iadd_lo
adc edx, iadd_hi
This is smaller/faster than what you had before:
add eax, iadd
mov edx, eax
sar edx, 16
I would completely unroll this on an AthlonXP because of the large icache.
Once unrolled, you can schedule appropriately. The mov/mul/accumulate is
your critical path (10 clk), so schedule around that to minimize its
latency. You should be able to get the latency of an iteration down to 2-3
cycles which is 3-5 times faster than what you had before.
-Matt
- Previous message: Matt Taylor: "Re: Creating a dis-assembler on my own, but would like to have some data to check it's working .."
- In reply to: Gian-Carlo Pascutto: "Re: Float/SSE optimization on Athlon/P4"
- Next in thread: Matt Taylor: "Re: Float/SSE optimization on Athlon/P4"
- Reply: Matt Taylor: "Re: Float/SSE optimization on Athlon/P4"
- Reply: Gian-Carlo Pascutto: "Re: Float/SSE optimization on Athlon/P4"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|