Re: Optimization Questions
- From: Randy Yates <spamtrap@xxxxxxxxxx>
- Date: Thu, 15 Mar 2007 15:46:40 -0400
"Mark_Larson" <spamtrap@xxxxxxxxxx> writes:
On Mar 14, 4:13 pm, Randy Yates <spamt...@xxxxxxxxxx> wrote:
"Mark_Larson" <spamt...@xxxxxxxxxx> writes:
I also have an optimization web page that is broken up into Beginner,
Intermediate and Advanced. I have quite a few SSE/SSE2 examples.
Hutch hosts it on his masm32 website ( thanks Hutch!).
http://www.mark.masmcode.com/
Thanks for the pointer, AND for CREATING the pointer, Mark!
--
% Randy Yates % "So now it's getting late,
%% Fuquay-Varina, NC % and those who hesitate
%%% 919-577-9882 % got no one..."
%%%% <y...@xxxxxxxx> % 'Waterfall', *Face The Music*, ELOhttp://home.earthlink.net/~yatescr
Not a problem. I really love optimizing using SSE/SSE2, you can do so
many different things, and on Intel processors it runs on average
twice as fast as AMD. I actually have several SSE/SSE2 tutorials I
posted on masm32. I forgot about them when I posted yesterday.
I've done a lot of SSE and SSE2 programming over the years. I have
an optimization website that goes over some basic tricks to speed up
code with SSE/SSE2 ( along with other tricks).
http://www.mark.masmcode.com/
P4's and up on the Intel side really run SSE/SSE2 code very fast. So
I've used that advantage a lot to make code run extremely fast.
converting a string to a qword using SSE2
http://www.oldboard.assemblercode.com/index.php?topic=4253.msg28940#m...
SSE2 quaternion multiply
http://www.oldboard.assemblercode.com/index.php?topic=3469.0
Mersenne Twister Random Number Generator in SSE2
http://www.oldboard.assemblercode.com/index.php?topic=3565.0
my account on masmforum got messed up ( all these links are for
masmforum). So some messages will say they are from hutch- instead of
marklarson. The way you tell it's the real me, is it'll say "guest"
under "hutch--".
Counting the number of lines in a file using SSE2
http://www.oldboard.assemblercode.com/index.php?topic=2692.msg18800#m...
string copy using SSE2
http://www.oldboard.assemblercode.com/index.php?topic=2632.msg18047#m...
Computing MD5 using SSE2
http://www.oldboard.assemblercode.com/index.php?topic=2921.0
I am working on a raytracer that I haven't finished yet. You can use
scalar SSE code just like FP code ( you don't do stuff in parallel,
it's a single floating point value you are doing an operation on).
Scalar code is faster on a P4. ( not sure about AMD).
http://www.masm32.com/board/index.php?topic=1140.0
line counting again. But I actually have 2 different versions using
2 different algorithms. If you scroll down the second posted one is
done in a non-intuitive manner.
http://www.masm32.com/board/index.php?topic=5434.msg40666#msg40666
Thanks for all the references and examples, Mark.
FYI, here's my convolution routine. It's designed for use in a very
long FIR filter (>2000 taps), so instead of coding both the inner
and outer loops in assembly, I just coded the inner one, which is
where 99% of the time is.
I know I could unroll the loop, but basically I'm lazy. If anyone
sees any further optimizations that could be done with a single
loop, then please tell. By the way, the data can't be aligned on
8-word boundaries, hence I can't use MOVDQA's to load the data. The
coefficients are aligned.
--Randy
;
;***********************************************************************************************************************************
;
; Module: DecimatorMAC (Decimator Multiply-Accumulate)
;
; Programmer: Randy Yates
;
; Function:
;
; This function performs an inner product between the coefficients and the input data for use in the
; BTSC Decoder decimator routine. It computes a single output sample of an FIR filter.
;
; Register State Upon Entry: standard C calling convention
;
; Register State Upon Exit:
;
; EAX = inner product result in lower 16 bits
;
; C Prototype:
;
; int16_t DecimatorMAC(int16_t* inpPtr, uint32_t inpIdx, uint32_t inpMsk)
;
; Comments, Limitations, Assumptions:
;
; 1. Assumes processor has SSE2 extensions.
;
; 2. Assumes the coefficient length is a multiple of 8.
;
; 3. Assumes the coefficient order is reversed. Note that, for a symmetric filter, this is redundant
;
; 4. Assumes gcc C calling conventions (is this different than standard C calling conventions?)
;
;***********************************************************************************************************************************
;
GLOBAL DecimatorMAC
;
EXTERN decimatorFilterCoef,decimatorFilterLength
;
; Passed parameter stack offset definitions
;
prmPtr EQU 20
prmIdx EQU 24
prmMsk EQU 28
;
SECTION .data
;
; temporary storage for 8 samples (16 bytes) when the circular buffer boundary is crossed
;
boundarySamples:
;
DW 0, 0, 0, 0, 0, 0, 0, 0
;
k0: DD 0 ; intermediate loop counts - see description below
k1: DD 0
k2: DD 0
;
j DW 0
;
SECTION .text
;
DecimatorMAC:
;
; save registers which the caller needs preserved (gcc C calling conventions)
;
push ebp
push esi
push edi
push ebx
;
mov ebp, esp
;
mov esi, [ebp+prmPtr] ; esi is input data pointer
mov ebx, [ebp+prmMsk] ; ebx is mask
mov edi, [ebp+prmIdx] ; edi is data index
mov ecx, [decimatorFilterLength] ; ecx is decrementing loop counter
xor edx, edx ; edx is incrementing loop counter
;
; The standard convolution kernel is
;
; y[n] = sum_{m=0}^{M-1} h[m] * x[n-m].
;
; We could just as easily (and equivalently) let m run backwards:
;
; y[n] = sum_{m=M-1}^{0} h[m] * x[n-m].
;
; If we let k = M-1-m ==> m = M-1-k, this becomes
;
; y[n] = sum_{k=0}^{M-1} h[M-1-k] * x[n-M+1+k].
;
; The coefficients are in reverse order
;
; b[0] = h[M-1]
; b[1] = h[M-2]
; .
; .
; .
; b[M-1] = h[0]
;
; which implies b[k] = h[M-1-k] - precisely the expression above.
; Using k, the index into the data runs positive. Thus we need to
; compute n-M+1 as the starting point in the data, where n = inpIdx
; and M = decimatorFilterLength:
;
sub edi, ecx
add edi, 1
;
; use the mask to 'circularize" the index
;
and edi, ebx
;
; Compute the total SSE2 loop count k as the decimator filter length divided by 8 (we assume the
; decimator filter length is a multiple of eight) since we're operating on 8 16-bit samples at a time:
;
shr ecx, 3 ; divide ecx by 3 since we process 8 samples at a time
;
;********************
; Processing Approach
;********************
;
; Due to the possible disconuity in the input data across the circular buffer boundary,
; the processing is partitioned into three sections:
;
; 1. From n0, the starting point and the current value of [edi], to the last value n0 + 8*k0 - 1 such that
;
; n0 + 8*k0 - 1 <= M
;
; and
;
; n0 + 8*(k0+1) - 1 > M,
;
; where M = inpMsk. k0 is easily determined as
;
; k0 = min(floor((M + 1 - n0)/8), k),
;
; The SSE2 parallel computations are performed k0 times in this first section.
;
mov eax, ebx
add eax, 1
sub eax, edi
shr eax, 3
mov [k0], eax
sub eax, ecx
jna k0_OK
;
mov [k0], ecx
;
k0_OK:
;
; 2. In this section, k1 SSE2 parallel computations are performed using the temporary storage area as the
; data address, where
;
; k1 = min(k-k0, 1).
;
; Thus 0 or 8 samples are processed. The samples (if any) in this section may straddle the boundary of the
; circular buffer, so they are copied to the temporary storage area boundarySamples.
;
mov eax, ecx
sub eax, [k0]
je k1_OK
;
mov eax, 1
;
k1_OK:
;
mov [k1], eax
;
; 3. In this section, k2 SSE2 parallel computations are performed, where
;
; k2 = k - k0 - k1
;
mov eax, ecx
sub eax, [k0]
sub eax, [k1]
mov [k2], eax
;
; Initialize j, the section number (0, 1, 2) (kj)
;
mov word [j], 2
;
; Debugging label
;
kComplete:
;
; SSE2 register usage definition:
;
; xmm0 = input data and intermediate MAC (4 32-bit words)
; xmm1 = coef data
; xmm3 = quad 32-bit accumulator
;
; zero accumulator:
;
pxor xmm3, xmm3
;
; Get k0 in ecx for first section
;
mov ecx, [k0]
or ecx, ecx
je Checkj ; it's possible k0 = 0
;
; OK, we're all set: here's the main loop
;
MACLoop:
;
; Get 8 words of data:
;
; movdqu xmm0, [esi+2*edi]
movq xmm0, [esi+2*edi]
movhps xmm0, [esi+2*edi+8]
;
MACLoopResume:
;
; Get 8 words filter coefficients:
;
movdqa xmm1, [decimatorFilterCoef+2*edx]
;
; Do the double-quad intermediate MAC (xmm0):
;
pmaddwd xmm0, xmm1
;
; increment and circularize the data index
;
add edi, 8
and edi, ebx
;
; shift right arithmetically three to scale
;
psrad xmm0, 3
;
; decrement loop counter
;
add edx, 8
sub ecx, 1
;
; accumulate into xmm3
;
paddd xmm3, xmm0
jnz MACLoop
;
; check j
;
Checkj:
;
sub word [j], 1
je Sectionk2
jl Finished
;
Sectionk1
;
mov ecx, [k1]
or ecx, ecx
je Finished
;
; move 8 input samples into boundarySamples
;
mov ecx, 8
mov ebp, boundarySamples
;
WordMove:
;
mov ax, [esi+2*edi]
mov [ebp], ax
;
; increment and circularize the data index
;
add edi, 1
and edi, ebx
;
; increment the ebp index into the boundarySamples buffer
;
add ebp, 2
;
; decrement loop counter
;
sub ecx, 1
jnz WordMove
;
movdqu xmm0, [boundarySamples]
add ecx, 1
;
; radjust the data index
;
add edi, -8
and edi, ebx
jmp MACLoopResume
;
Sectionk2:
;
mov ecx, [k2]
or ecx, ecx
jne MACLoop
;
Finished:
;
; OK, we've got 4 intermediate 32-bit results packed into xmm3. Sum those
; and take the MSW and we've got our result.
;
; move result onto stack
;
movdqu [esp-16], xmm3
;
; move the first result into the accumulatr,
; then sum the other three
;
mov eax, dword [esp-16]
add eax, dword [esp-12]
add eax, dword [esp-08]
add eax, dword [esp-04]
;
; Finally, perform an arithmetic right shift by 16
;
shr eax, 16
;
; restore registers
;
RegisterRestore:
;
pop ebx
pop edi
pop esi
pop ebp
;
ret
;
END
--
% Randy Yates % "Remember the good old 1980's, when
%% Fuquay-Varina, NC % things were so uncomplicated?"
%%% 919-577-9882 % 'Ticket To The Moon'
%%%% <yates@xxxxxxxx> % *Time*, Electric Light Orchestra
http://home.earthlink.net/~yatescr
.
- Follow-Ups:
- Re: Optimization Questions
- From: Mark_Larson
- Re: Optimization Questions
- References:
- Optimization Questions
- From: Randy Yates
- Re: Optimization Questions
- From: Mark_Larson
- Re: Optimization Questions
- From: Randy Yates
- Re: Optimization Questions
- From: Mark_Larson
- Optimization Questions
- Prev by Date: Re: uses of xor
- Next by Date: Re: Optimization Questions
- Previous by thread: Re: Optimization Questions
- Next by thread: Re: Optimization Questions
- Index(es):
Relevant Pages
|