Re: optimizing code segment
- From: "Matt" <spamtrap@xxxxxxxxxx>
- Date: Fri, 13 May 2005 19:29:03 +0000 (UTC)
spamtrap@xxxxxxxxxx wrote:
> I am new to MMX/SSE/SSE2. I have the following code to optimize:
>
> int a=
> _coeffs[0]*lookup[offsets[0]]+
> _coeffs[1]*lookup[offsets[1]]+
> _coeffs[2]*lookup[offsets[2]]+
> _coeffs[3]*lookup[offsets[3]]+
> _coeffs[4]*lookup[offsets[4]]+
> _coeffs[5]*lookup[offsets[5]]+
> _coeffs[6]*lookup[offsets[6]]+
> _coeffs[7]*lookup[offsets[7]];
>
> where
> _coeffs is an array of ints from -4 to 4
> lookup is an array of unsigned ints
> offsets is an array of unsigned ints which I have ordered in
> increasing order with the hope that it will improve performance, but
> they vary a lot (from 1 to 2^20)
Are any of these arrays constant? The problem changes significantly if
offsets[] is immutable.
> The target machines are today's mid/high range PCs.
>
> Questions:
> - what is the best way to optimize this?
> - which extension is the best given the problem and target machines? I
> understand MMX is getting obsolete but SSE2 is not widespread on AMD
> yet.
There is no extension that will solve your problem. First, memory can only
be accessed from general registers. This means that the values offsets[0] ..
offsets[7] will need to be loaded into general purpose registers. Next,
neither MMX nor SSE/SSE2/SSE3 allow you to load a full register from
multiple locations. You can use multiple movd instructions and then shuffles
(or punpckldq in MMX) to accomplish this, but it is slower. The next problem
is that neither MMX nor SSE/SSE2/SSE3 support parallel 32-bit multiplies.
Oops. Finally, it is slow to take the accumulated result and move it back
into a general-purpose register.
In short, it will probably be significantly faster to not use MMX, and MMX
will be faster than using SSE, SSE2, or SSE3.
> - is prefetching going to help? Is this code memory access bound?
Probably no, probably yes. Prefetching will help if you can predict the
memory addresses that will be needed before you need to load them. It
doesn't look to me like you can do this, and if you can't predict them, then
prefetching won't help. Yes, this code looks depedent on the latency of
memory.
> - is it better to replace the multiplications with additions and
> subtractions, given my multiplier range of -4 to 4?
On a Pentium-4, maybe. A multiply takes 14 cycles at a minimum on P4, and 10
on P4E. The 4 additions take 2 cycles on a P4 and 4 cycles on a P4E. On an
Opteron, things are different. On an Opteron, a 32-bit multiply takes 3
cycles to complete. On an Athlon, it takes 5 cycles, and this is still
probably the best you can do.
-Matt
.
- References:
- optimizing code segment
- From: spamtrap
- optimizing code segment
- Prev by Date: Re: passing arguments
- Next by Date: Re: Access violation is confusing me.
- Previous by thread: optimizing code segment
- Next by thread: Re: optimizing code segment
- Index(es):
Relevant Pages
|