Re: optimizing code segment



spamtrap@xxxxxxxxxx wrote:
> I am new to MMX/SSE/SSE2. I have the following code to optimize:
>
> int a=
> _coeffs[0]*lookup[offsets[0]]+
> _coeffs[1]*lookup[offsets[1]]+
> _coeffs[2]*lookup[offsets[2]]+
> _coeffs[3]*lookup[offsets[3]]+
> _coeffs[4]*lookup[offsets[4]]+
> _coeffs[5]*lookup[offsets[5]]+
> _coeffs[6]*lookup[offsets[6]]+
> _coeffs[7]*lookup[offsets[7]];
>
> where
> _coeffs is an array of ints from -4 to 4
> lookup is an array of unsigned ints
> offsets is an array of unsigned ints which I have ordered in
> increasing order with the hope that it will improve performance, but
> they vary a lot (from 1 to 2^20)

Are any of these arrays constant? The problem changes significantly if
offsets[] is immutable.

> The target machines are today's mid/high range PCs.
>
> Questions:
> - what is the best way to optimize this?
> - which extension is the best given the problem and target machines? I
> understand MMX is getting obsolete but SSE2 is not widespread on AMD
> yet.

There is no extension that will solve your problem. First, memory can only
be accessed from general registers. This means that the values offsets[0] ..
offsets[7] will need to be loaded into general purpose registers. Next,
neither MMX nor SSE/SSE2/SSE3 allow you to load a full register from
multiple locations. You can use multiple movd instructions and then shuffles
(or punpckldq in MMX) to accomplish this, but it is slower. The next problem
is that neither MMX nor SSE/SSE2/SSE3 support parallel 32-bit multiplies.
Oops. Finally, it is slow to take the accumulated result and move it back
into a general-purpose register.

In short, it will probably be significantly faster to not use MMX, and MMX
will be faster than using SSE, SSE2, or SSE3.

> - is prefetching going to help? Is this code memory access bound?

Probably no, probably yes. Prefetching will help if you can predict the
memory addresses that will be needed before you need to load them. It
doesn't look to me like you can do this, and if you can't predict them, then
prefetching won't help. Yes, this code looks depedent on the latency of
memory.

> - is it better to replace the multiplications with additions and
> subtractions, given my multiplier range of -4 to 4?

On a Pentium-4, maybe. A multiply takes 14 cycles at a minimum on P4, and 10
on P4E. The 4 additions take 2 cycles on a P4 and 4 cycles on a P4E. On an
Opteron, things are different. On an Opteron, a 32-bit multiply takes 3
cycles to complete. On an Athlon, it takes 5 cycles, and this is still
probably the best you can do.

-Matt

.



Relevant Pages

  • Re: optimizing code segment
    ... >_coeffs is an array of ints from -4 to 4 ... >offsets is an array of unsigned ints which I have ordered in increasing ... >understand MMX is getting obsolete but SSE2 is not widespread on AMD ... applications such as this, MMX vector component multiplications ...
    (comp.lang.asm.x86)
  • Bug+fix: PDC20271 RAID detection fails
    ... My array was not detected by my kernel. ... the PDC RAID superblock, that is located at the start ... of the last track on the disk. ... is a multiple of track size and if not, ...
    (comp.os.linux.hardware)
  • Re: Another Filegroup question
    ... I would suggest multiple filegroups only so it would be easier to ... small array as this I don't think multiple files will buy you anything. ... don't think you have too much of a choice if you only have 4 drives that are ...
    (microsoft.public.sqlserver.server)
  • RE: Using Match function with duplicate values in an array
    ... > set up a helper column ... >> function to find the row reference in the array. ... >> are multiple equal values in the column. ...
    (microsoft.public.excel.worksheet.functions)
  • ADO and datasets in C# versus traditional OO class approach
    ... The application controls multiple physical devices in real time. ... need something inbetween to sort it out type table). ... info out of the business layer, which seems a bit time consuming. ... Device objects, which would contain an array of working set objects, which ...
    (microsoft.public.dotnet.framework.windowsforms.databinding)