Re: SSE2-Sort within a register

From: Gerd Isenberg (spamtrap_at_crayne.org)
Date: 01/11/05


Date: Tue, 11 Jan 2005 18:54:30 +0000 (UTC)


Matt wrote:
> "Gerd Isenberg" <spamtrap@crayne.org> wrote in message
> news:1105348184.233831.145150@c13g2000cwb.googlegroups.com...
> > Hi Matt,
> >
> > thanks for pointing out the gp-code. Yes, SSE2 is almost too slow -
> > OTOH one saves the gp-ressources by using xmm registers - and there
is
> > some hope with future 128-bit ALUS.
>
> Yes, but once those come along we will be writing x86-64 code. Not
only does
> it have twice as many regs, but the sort will be even faster since
one can
> manipulate 4 shorts at a time.

Agreed. I have a 128-bit SIMD or bitboard[2] base class for the memory
layout and two derivations, one with SSE2-intrinsics and one in plain
C++. Both derivations have most operators overloaded.
For instance i can write kogge stone parallel prefix fills in C++ with
template type <XMM> or <GP>, controlling the register incarnations of
local variables. Parameter passing is done via source and target
pointers. One can mix different disjoint direction fills with different
register files. The msc beta for AMD64 and probably other compilers as
well have very nice scheduling abilities to mix instructions from
disjoint inlined routines. So one has something like one double
dispatch XMM- instruction each 2-6 GP-instructions. Of course this is
all branchless stuff.

>
> > Even if there is no practical application to sort eg. moves inside
my
> > chess program, there are probably some other practical applications
to
> > use SSE2, harder to beat with gp.
>
> Yes, but very few. :-(
>
> Even the various 3DNow! and MMX bitboard scanning routines that you
had
> proposed, though clever, ended up being slower. This is of course in
part
> because it is slow getting data in/out of the SIMD registers, and
SIMD
> registers are particularly difficult to work with when you may need
to
> branch on the data. The biggest factor, I think, is that timings on
the GP
> set are very tight: I can execute 3 ops/cycle with a latency of 1
cycle for
> nearly all important ops (multiplies being the only exception, and
still
> they are quite fast). P4 makes most MMX/SSE2 instructions
ridiculously
> expensive; Athlon is much better, but still the latency is at best 2
cycles.
> Worse still, the MMX/SSE2 instruction sets are so unorthogonal as to
> exaggurate these latencies: where are pminsd and pmaxsd? What
happened to
> pminus and pmaxus?

Yes, PSLLW, PSLLD and PSLLQ work bitwise with possible second register
operand. PSLLDQ works bytewise and only with immediate operand - and so
on...

I guess G5 Altivec is more orthogonal.

>
> > One is to pick the best move out of a move list with up to 64
scored
> > moves. Moves and their scores are kept in separate arrays. Scores
are
> > [-512..+511] with 6 lsb containing the index itself.
>
> Wouldn't this be the purpose of your bubble sort?

No - since you can expect a beta cut after picking and trying a move,
in a lot of nodes, presorting all moves is a waste of time.

>I suppose MMX would win
> this since it can perform 8 max(a,b) computations per cycle. In 18
cycles
> you have it narrowed down to 4 choices. At that point it is probably
cheaper
> to transfer the results to GP regs for the final computation. GP regs
can do
> 1 max(a,b) in 2 cycles, so the remainder can be computed in less than
6
> cycles. MMX would require 8 cycles for the remainder.
>
> > Other samples are dot-products of bit/bool[64]*byte[64] or
> > bit/bool[64]*short[64]. It is very neat to expand or sign extend 64
> > bits, a bitboard, to 64-Bytes in four xmm-registers.
> > Multiplication with -1|0 Bytes is then performed by simple "and"
mask -
> > followed by some vertical and horizontal (PSADBW) adds.
>
> Interesting. This would be more difficult to replicate with GP regs
and
> would obviously require a significant amount of memory access. The
> throughput would be just under 1.5 bit->byte expansions per cycle
> (shl/setc/sets/store for 2 bits at a time -- all fully overlappable)
for a
> total latency of about 50 cycles. A single sadbw is somewhere in the
range
> of 5-8 cycles, so GP regs would at least give MMX/SSE2 a run for
their
> money.
>
> -Matt

The strange thing is that often using twice as much MMX-Instructions
and MMX-registers with much more code (not exactly the doubled size,
due to one prefix byte less) is faster than SSE2 on amd64. That
indicates that there is something to improve with SSE in the future ;-)

Btw. in Hans de Vries paper "Understanding the detailed Architecture
of AMD's 64 bit Core" it sounds a bit more optimistic:

-----------------
http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html

1.4 128 bit SSE(2) instructions are split into Doubles
Appendix C of Opteron's Optimization Guide specifies to which class
each and every instruction belongs. Most 128 bit SSE and SSE2
instructions are implemented as double dispatch instructions. Only
those that can not be split into two independent 64 bit operations are
handled as Vector Path (Micro Code) instructions. Those SSE2
instructions that operate on only one half of a 128 bit register are
implemented as a single (Direct Path) instruction.

There are both advantages and disadvantages performance-wise here. A
disadvantage may be that the decode rate of 128 bit SSE2 instructions
is limited to 1.5 per cycle. In general however this not a performance
limiter because the maximum throughput is limited by the FP units and
the retirement hardware to a single 128 bit SSE instruction per cycle.
More important is the extra cycle latency that a Pentium 4 style
implementation would bring is avoided.

-----------------

Gerd



Relevant Pages

  • Re: Optimization Questions
    ... cycles you'd save would be more than offset by the cycles you'd burn ... instructions go through port 0 and port 1. ... a 16-bit register, writing one afterwards will be fast. ... Pre-read the value in EAX ...
    (comp.lang.asm.x86)
  • Re: Lies, damn lies and benchmarks
    ... When running using just the 16-bit registers, ... extra cycles when run on the 386 over the 286 (these were mostly system ... instructions which didn't get run too often anyways), ... The FPU was another story, the 287 FPU was usually run at an asynchronous ...
    (comp.security.misc)
  • Re: Microcontroller ... which one ??
    ... > RISC is not a language. ... > instructions on the AVR are one cycle. ... Also, the PIC has a single working register, ... the PIC needs to load them into the W register (four cycles) ...
    (sci.electronics.design)
  • Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes
    ... cycles to the bus. ... LOCK slowness is not because of the bus. ... maybe 150-200 regular pipelined, superscalar instructions. ...
    (Linux-Kernel)
  • Re: Adjusting PC Hyperthreading for Spice Simulation
    ... ago), 350 CPU cycles for a code cache miss was not atypical, but RAM ... delay in which a sequence of instructions totalling 100 cycles could be ... and others) support speculative execution and out of order execution ...
    (sci.electronics.design)