SSE2 Gem - bit[64]*byte[64]

From: Gerd Isenberg (spamtrap_at_crayne.org)
Date: 01/10/05


Date: Mon, 10 Jan 2005 20:45:25 +0000 (UTC)

Hi all,

this routine is another small SSE2 Gem, i wan't to share and probably
to discuss here. A kind of dot-product with two 64 element vectors -
one boolean and a byte vector. Like this C-code:

for (sum=0, i=0; i < 64; i++)
  if (boolx[i]) sum += weight[i];

In bitboard based chess programs as well as other 8*8 board games it
might be usefull to calculate some square weighted mobility with some
attack bitboards and appropriate weight arrays.

The routine expands or "sign extends" each bit of a bitboard, an
unsigned __int64 with msc, to one byte each, 64*8 == 512 bit, 4 xxm or
8 mmx registers. This is done by a shuffle sequence with punpckxxx
instructions to make eight copies of each byte in the right order.
Eight consecutive eight byte copies are masked with 0x8040201008040201
to get the individual bits. A parallel byte compare not equal zero
expands the bits to bytes.

The boolean multiply is now performed by anding with the 64 bytes of
the weight vector. Last step is vertical and horizontal add (or better
vice versa). Since in my application single weights are usually in a
0..32 range and rarely hit 63, it safes some cycles to perform the
vertical add before the horizontal one with only one psadbw instead of
four.

About 5/6 cycles faster is a 90 degree rotated vector product. Only
one punpcklbw and three movdqa to make the copies, but (not) anding
the bits a bit different.

For a let say kind of "interactive" forum developing i want to mention
few nemes from Computer Chess Club: Tony Werten asked the initial
question, Bob Hyatt told something about vector instructions on a
Cray. I did the initial SSE2 implementation with 90 degree rotated
weights and Anthony Cozzy came up with the idea of using the unpack
and interleave instructions.

Probably Matt or others are able to perform better with GP-registers
than the 38 cycles i got on my amd64 box. There is a faster
MMX-Version - but MMX as well as x87 is taboo unter windows for amd64
- and GP. Is it so expensive to save/restore the additional few
registers during context switch?

Gerd

/*
  dotProduct for 0 <= weights <= 63

  paddusb xmm0, xmm1 ; add all bytes (with saturation)
  paddusb xmm0, xmm2 ; saturation is used
  paddusb xmm0, xmm3 ; to avoid "accidental" overflows
  psadbw xmm0, xmm4 ; horizontal add 2 * 8 bytes

  for 0 <= weights <= 255
  use seven instead four add instructions

  psadbw xmm0, xmm4
  psadbw xmm1, xmm4
  psadbw xmm2, xmm4
  psadbw xmm3, xmm4
  paddw xmm0, xmm1
  paddw xmm0, xmm2
  paddw xmm0, xmm3

*/

typedef unsigned __int64 BitBoard;
typedef unsigned char BYTE;
#define XMM_ALIGN __declspec(align(16))

int dotProduct(BitBoard bb, BYTE weights[] /* XMM_ALIGN */)
{
 static const BitBoard XMM_ALIGN bits[2] =
    {0x8040201008040201,0x8040201008040201};
 __asm
 {
  movq xmm0, [bb] ; 00000000000000008040201008040201
  movdqa xmm5, [bits]
  punpcklbw xmm0, xmm0 ; 80804040202010100808040402020101
  pxor xmm4, xmm4 ; zero
  mov eax, [weights]
  movdqa xmm2, xmm0
  punpcklwd xmm0, xmm0 ; 08080808040404040202020201010101
  punpckhwd xmm2, xmm2 ; 80808080404040402020202010101010
  movdqa xmm1, xmm0
  movdqa xmm3, xmm2
  punpckldq xmm0, xmm0 ; 02020202020202020101010101010101
  punpckhdq xmm1, xmm1 ; 08080808080808080404040404040404
  punpckldq xmm2, xmm2 ; 20202020202020201010101010101010
  punpckhdq xmm3, xmm3 ; 80808080808080804040404040404040
  pandn xmm0, xmm5 ; mask the bits
  pandn xmm1, xmm5
  pandn xmm2, xmm5
  pandn xmm3, xmm5
  pcmpeqb xmm0, xmm4 ; extend bits to bytes
  pcmpeqb xmm1, xmm4
  pcmpeqb xmm2, xmm4
  pcmpeqb xmm3, xmm4
  pand xmm0, [eax+0*16] ; multiply by "and" with -1 or 0
  pand xmm1, [eax+1*16]
  pand xmm2, [eax+2*16]
  pand xmm3, [eax+3*16]
  paddusb xmm0, xmm1 ; add all bytes (with saturation)
  paddusb xmm0, xmm2
  paddusb xmm0, xmm3
  psadbw xmm0, xmm4 ; horizontal add 2 * 8 byte
  pextrw edx, xmm0, 4 ; extract both intermediate sums to gp
  pextrw eax, xmm0, 0
  add eax, edx ; final add
 }
}