Re: SSE2-Sort within a register
From: Gerd Isenberg (spamtrap_at_crayne.org)
Date: Tue, 11 Jan 2005 18:54:30 +0000 (UTC)
> "Gerd Isenberg" <firstname.lastname@example.org> wrote in message
> > Hi Matt,
> > thanks for pointing out the gp-code. Yes, SSE2 is almost too slow -
> > OTOH one saves the gp-ressources by using xmm registers - and there
> > some hope with future 128-bit ALUS.
> Yes, but once those come along we will be writing x86-64 code. Not
> it have twice as many regs, but the sort will be even faster since
> manipulate 4 shorts at a time.
Agreed. I have a 128-bit SIMD or bitboard base class for the memory
layout and two derivations, one with SSE2-intrinsics and one in plain
C++. Both derivations have most operators overloaded.
For instance i can write kogge stone parallel prefix fills in C++ with
template type <XMM> or <GP>, controlling the register incarnations of
local variables. Parameter passing is done via source and target
pointers. One can mix different disjoint direction fills with different
register files. The msc beta for AMD64 and probably other compilers as
well have very nice scheduling abilities to mix instructions from
disjoint inlined routines. So one has something like one double
dispatch XMM- instruction each 2-6 GP-instructions. Of course this is
all branchless stuff.
> > Even if there is no practical application to sort eg. moves inside
> > chess program, there are probably some other practical applications
> > use SSE2, harder to beat with gp.
> Yes, but very few. :-(
> Even the various 3DNow! and MMX bitboard scanning routines that you
> proposed, though clever, ended up being slower. This is of course in
> because it is slow getting data in/out of the SIMD registers, and
> registers are particularly difficult to work with when you may need
> branch on the data. The biggest factor, I think, is that timings on
> set are very tight: I can execute 3 ops/cycle with a latency of 1
> nearly all important ops (multiplies being the only exception, and
> they are quite fast). P4 makes most MMX/SSE2 instructions
> expensive; Athlon is much better, but still the latency is at best 2
> Worse still, the MMX/SSE2 instruction sets are so unorthogonal as to
> exaggurate these latencies: where are pminsd and pmaxsd? What
> pminus and pmaxus?
Yes, PSLLW, PSLLD and PSLLQ work bitwise with possible second register
operand. PSLLDQ works bytewise and only with immediate operand - and so
I guess G5 Altivec is more orthogonal.
> > One is to pick the best move out of a move list with up to 64
> > moves. Moves and their scores are kept in separate arrays. Scores
> > [-512..+511] with 6 lsb containing the index itself.
> Wouldn't this be the purpose of your bubble sort?
No - since you can expect a beta cut after picking and trying a move,
in a lot of nodes, presorting all moves is a waste of time.
>I suppose MMX would win
> this since it can perform 8 max(a,b) computations per cycle. In 18
> you have it narrowed down to 4 choices. At that point it is probably
> to transfer the results to GP regs for the final computation. GP regs
> 1 max(a,b) in 2 cycles, so the remainder can be computed in less than
> cycles. MMX would require 8 cycles for the remainder.
> > Other samples are dot-products of bit/bool*byte or
> > bit/bool*short. It is very neat to expand or sign extend 64
> > bits, a bitboard, to 64-Bytes in four xmm-registers.
> > Multiplication with -1|0 Bytes is then performed by simple "and"
> > followed by some vertical and horizontal (PSADBW) adds.
> Interesting. This would be more difficult to replicate with GP regs
> would obviously require a significant amount of memory access. The
> throughput would be just under 1.5 bit->byte expansions per cycle
> (shl/setc/sets/store for 2 bits at a time -- all fully overlappable)
> total latency of about 50 cycles. A single sadbw is somewhere in the
> of 5-8 cycles, so GP regs would at least give MMX/SSE2 a run for
The strange thing is that often using twice as much MMX-Instructions
and MMX-registers with much more code (not exactly the doubled size,
due to one prefix byte less) is faster than SSE2 on amd64. That
indicates that there is something to improve with SSE in the future ;-)
Btw. in Hans de Vries paper "Understanding the detailed Architecture
of AMD's 64 bit Core" it sounds a bit more optimistic:
1.4 128 bit SSE(2) instructions are split into Doubles
Appendix C of Opteron's Optimization Guide specifies to which class
each and every instruction belongs. Most 128 bit SSE and SSE2
instructions are implemented as double dispatch instructions. Only
those that can not be split into two independent 64 bit operations are
handled as Vector Path (Micro Code) instructions. Those SSE2
instructions that operate on only one half of a 128 bit register are
implemented as a single (Direct Path) instruction.
There are both advantages and disadvantages performance-wise here. A
disadvantage may be that the decode rate of 128 bit SSE2 instructions
is limited to 1.5 per cycle. In general however this not a performance
limiter because the maximum throughput is limited by the FP units and
the retirement hardware to a single 128 bit SSE instruction per cycle.
More important is the extra cycle latency that a Pentium 4 style
implementation would bring is avoided.