Re: Optimization Questions
- From: "Mark_Larson" <spamtrap@xxxxxxxxxx>
- Date: 20 Mar 2007 12:00:26 -0700
On Mar 19, 4:45 pm, Randy Yates <spamt...@xxxxxxxxxx> wrote:
"Mark_Larson" <spamt...@xxxxxxxxxx> writes:
couple quick tips
Hi Mark,
Thanks for your comprehensive response. Let me respond below.
Get rid of the MOVDQU. [...]
Love to, but can't. It's impossible. Actually, it would be
possible, but I doubt if it would help since you'd have to
move a lot of data around.
The problem is two-fold:
1) The input is a wrap-around buffer (a circular buffer), so there
is potentially a discontinuity at the buffer boundary depending on
where the index is.
2) The index into the data changes 7 samples at a time, so no matter
what the buffer alignment, 7 out of 8 calls to the routine will have
unaligned indexes. Sure you could move the data around, but the
cycles you'd save would be more than offset by the cycles you'd burn
moving the data.
pshufd trick. there are 4 ports on the P4. ALL SSE/SSE2 go through
port 1. The one exception is any kind of move instruction whether it
is floating point, mmx, sse, or sse2 all go through port 0. Integer
instructions go through port 0 and port 1. So here is the rub.
Pshufd is 2 cycles faster on a P4 ( 4 cycles compared to 6 cycles on
the P4). But it goes down port 1. So in some cases it is faster to
replace all your movdqas with pshufd. The way I do it is I try all
combinations of replacing them. In general doing this if you have two
back to back movdqa's is gonna be faster.
Nice trick to know, but since the above constraint holds, I won't have
back-to-back movdqa's.
PREVIOUS CODE
; Get 8 words of data:
;
movdqu xmm0, [esi+2*edi]
;
MACLoopResume:
;
; Get 8 words filter coefficients:
;
movdqa xmm1, [decimatorFilterCoef+2*edx]
REVISED code, changing movdqu to movdqa and adding a pshufd to second
movdqa, so they can execute in parallel
; Get 8 words of data:
;
movdqa xmm0, [esi+2*edi]
;
MACLoopResume:
;
; Get 8 words filter coefficients:
;
pshufd xmm1, [decimatorFilterCoef+2*edx],0E4h
16-bit code. don't use it. Switch to using movzx. 16 bit code on a
p4 runs horribly. Movzx is for unsigned data. Movsx is for signed
data, so make sure you use the right one based upon the data you are
using. You only need to change the code where you read the data into
a 16-bit register, writing one afterwards will be fast. The issue has
to do without partial register stalls. this is also an example of a
write-after-read, which I talk about below. Pre-read the value in EAX
with movzx or movsx 4 or 5 lines ahead if possible.
mov ax, [esi+2*edi]
mov [ebp], ax
;move me up 4 or 5 lines
movzx eax, dword ptr [esi+2*edi] ;gets rid of
one 16-bit access
mov [ebp], ah ;this avoids
a partial register stall because all of EAX was updated on the
previous instruction.
Good to know, but as this is outside the inner loop, it's in the noise.
memory optimizations. Try and pre-read data into a register before
you use it. Memory access instructions usually take a number of
cycles longer since they have to get the data from memory first. So
pre-reading into a register will give a big speed up. I saw lots of
places in your code where you read data and then immediately use it.
Pre-read it a number of instructions ahead( 4 or 5 is a good rule of
thumb). For SSE2/SSE instructions you want more than several
instructions ahead. About twice of that.
second memory optimization trick. You can pre-read the data for SSE/
SSE2 registers on the previous loop. You move it to the end. You
aren't using that many SSE2 registers. You are using XMM0, XMM1,
XMM3. So you can use the rest to pre-read the data. For that to work
right, you have to add a read of them into the other SSE2 registers
above the first label in your loop
;add setting up XMM registers here outside the loop to pre-read
them for the first loop execution.
mov eax, ebx
add eax, 1
sub eax, edi
shr eax, 3
mov [k0], eax
sub eax, ecx
jna k0_OK
;
mov [k0], ecx
k0_OK:
redundant XMM code. the MOVQ and the MOVHPS do the same thing as the
single MOVDQU. get rid of them. Also align the memory you are using
and switch to MOVDQA.
MACLoop:
;
; Get 8 words of data:
;
movdqa xmm0, [esi+2*edi]
; movdqu xmm0, [esi+2*edi]
; movq xmm0, [esi+2*edi]
; movhps xmm0, [esi+2*edi+8]
;
if you read a register after just having written it, you get a stall.
Because the previous instruction has to finish executing before the
next instruction can execute. That's why I wanted you to pre-read all
your data before you need it. Otherwise it is going to stall the
bus. This is called a write-after-read. A read-after-a-write also
stalls. So if you just wrote data to memory or a register, and then
on the next instruction you use it as the source, it will stall. A
write means that the register or memory was the destination of the
previous instruction. And a read means the source is being used for
the next instruction. Memory reads can take a while, so you'd wanna
space these out 4 or so instructions, maybe more.
mov eax,[memory]
add ecx,eax
the p4 can theoritically execute 4 instructions in ALU instructions in
parallel ( the trace cache keeps that from happening since it only
issues 3 instructions at a time). Anyways, because of that you can
execute two ALU pieces of code by interleaving the instructions. This
usually gives a good speed up.
piece1_line1
piece2_line1
piece1_line2
piece2_line2
piece1_line3
piece2_line3
piece1_line4
piece2_line4
piece1_line5
piece2_line5
there are a few exceptions to that, for example, sometimes you have to
do a slow instruction in one of the pieces of code. If so you would
want to pad with more fast instructions from the first loop. Slow
instructions are like MUL and DIV. anything that takes more than a
few cycles to execute. Get a copy of the Intel optmization book. I
am assuming you have an Intel, since most people do.
Yes, I knew loop unrolling would be another level of optimization, but
working out the details of what happens across the buffer boundary would
be a nightmare.
This code is targeted for the Intel Core Duo 1.83 GHz. The goal was to
get <25% CPU time (on a single CPU). (Note that there will be four
instances of this running simultaneously. With the filter size of 2248
samples and a decimation ratio of 7:1, the current cycle count works
out to be about 6% per filter).
In reality I'm getting about 10%. I think it may be due to the cache
line misalignment of the movqu, but as I stated there's no way around
this.
Here's the latest, which I think is identical execution-wise, but has
some cycle count info in the header comments and inner loop comments:
Thanks again for your pointers, Mark.
--Randy
;
;***********************************************************************************************************************************
;
; Module: DecimatorMAC (Decimator Multiply-Accumulate)
;
; Programmer: Randy Yates
;
; Function:
;
; This function performs an inner product between the coefficients and the input data for use in the
; BTSC Decoder decimator routine. It computes a single output sample of an FIR filter.
;
; Register State Upon Entry: standard C calling convention
;
; Register State Upon Exit:
;
; EAX = inner product result in lower 16 bits
;
; C Prototype:
;
; int16_t DecimatorMAC(int16_t* inpPtr, uint32_t inpIdx, uint32_t inpMsk)
;
; Comments, Limitations, Assumptions:
;
; 1. Assumes processor has SSE2 extensions.
;
; 2. Assumes the coefficient length is a multiple of 8.
;
; 3. Assumes the coefficient order is reversed. Note that, for a symmetric filter, this is redundant
;
; 4. Assumes gcc C calling conventions (is this different than standard C calling conventions?)
;
; CPU Utilization:
;
; 281 vector multiplies per output: ; [6.15*281] = [1728.15]
; 64000 outputs/second: ; [1728.15*64000] = [110601600] == 111 MIPS == 6.06% CPU time
;
;***********************************************************************************************************************************
;
GLOBAL DecimatorMAC
;
EXTERN decimatorFilterCoef,decimatorFilterLength
;
; Passed parameter stack offset definitions
;
prmPtr EQU 20
prmIdx EQU 24
prmMsk EQU 28
;
SECTION .data
;
; temporary storage for 8 samples (16 bytes) when the circular buffer boundary is crossed
;
boundarySamples:
;
DW 0, 0, 0, 0, 0, 0, 0, 0
;
k0: DD 0 ; intermediate loop counts - see description below
k1: DD 0
k2: DD 0
;
j DW 0
;
SECTION .text
;
DecimatorMAC:
;
; save registers which the caller needs preserved (gcc C calling conventions)
;
push ebp
push esi
push edi
push ebx
;
mov ebp, esp
;
mov esi, [ebp+prmPtr] ; esi is input data pointer
mov ebx, [ebp+prmMsk] ; ebx is mask
mov edi, [ebp+prmIdx] ; edi is data index
mov ecx, [decimatorFilterLength] ; ecx is decrementing loop counter
xor edx, edx ; edx is incrementing loop counter
;
; The standard convolution kernel is
;
; y[n] = sum_{m=0}^{M-1} h[m] * x[n-m].
;
; We could just as easily (and equivalently) let m run backwards:
;
; y[n] = sum_{m=M-1}^{0} h[m] * x[n-m].
;
; If we let k = M-1-m ==> m = M-1-k, this becomes
;
; y[n] =...
read more »
I don't have time to read it right now, I am busy at work. However
two things. You can still pair movdqu and pshufd, they both go down
different ports. The pshufd will just finish in less than half the
time the movdqa.
the particular place I am talking about is.
movq xmm0, [esi+2*edi] ; [0.33]
movhps xmm0, [esi+2*edi+8] ; [1]
movdqa xmm1, [decimatorFilterCoef+2*edx] ; [1]
you can change the movdqa to a pshufd using 0E4h for the mask. On a
P4 pshufd and MOVhps both take 4 cycles on my P4, so we need to do
them first in parallel and then the slower one last. just add this
and see if it is faster, if not, remove it. I am not sure about the
timings on a Core 2 duo. If the first two instructions are the same
cycles on your Core 2 Duo, then they will perfectly pair. 4 cycles a
piece with different ports.
movhps xmm0, [esi+2*edi+8] ; [1]
pshufd xmm1, [decimatorFilterCoef+2*edx] ; [1]
movq xmm0, [esi+2*edi] ; [0.33]
also try this version. with them after the movq. Since I don't know
the timings for the Core 2 Duo this might be faster.
movq xmm0, [esi+2*edi] ; [0.33]
movhps xmm0, [esi+2*edi+8] ; [1]
pshufd xmm1, [decimatorFilterCoef+2*edx] ; [1]
after trying both of those and seeing which is faster ( original,
version or version 2), leave the change in, and then do this part.
I am not sure if the core 2 duo fixed this or not, but there are
partial register stalls on P4s when you access a 16-bit register. you
have quite a few 16-bit accesses. Change them all to movsx or movzx,
and re-time, see if it is faster doing that. Again you only have to
do the initial move from memory to the register.
movsx eax,word [memory] ;tells the
processor to read a word value and sign extend it into eax
movzx eax,word [memory] ;tells the
processor to read a word value and zero extend it into eax
do you have a comprehensive description of the algorithm?
Mark
.
- References:
- Optimization Questions
- From: Randy Yates
- Re: Optimization Questions
- From: Mark_Larson
- Re: Optimization Questions
- From: Randy Yates
- Re: Optimization Questions
- From: Mark_Larson
- Re: Optimization Questions
- From: Randy Yates
- Re: Optimization Questions
- From: Mark_Larson
- Re: Optimization Questions
- From: Randy Yates
- Optimization Questions
- Prev by Date: Re: Harleyesque twiddle required.
- Next by Date: Re: Instruction Cycle of a 8086 statement
- Previous by thread: Re: Optimization Questions
- Next by thread: Re: Optimization Questions
- Index(es):
Relevant Pages
|