Branch-less, loop-less "Move" implementations for MMs



Anybody interested in starting a sub-challenge for a "Move" implementation
dedicated to memory managers, that would have neither branches nor loops?

I'm thinking of having code generated at runtime (depending on per-CPU
and per-size-range templates) to assemble procedures that would be
specialized in copying exactly 16, 32, 48, etc. bytes, kinda like

Move16
	movaps xmm0, [eax+0]
	movaps [edx+0], xmm0
	ret

Move32
	movaps xmm0, [eax+0]
	movaps xmm1, [eax+16]
	movaps [edx+0], xmm0
	movaps [edx+16], xmm1
	ret

For all the MMs with fixed-size blocks arranged in pages/sheets,
the relevant Move could be assembled (at an adequately aligned address)
and then directly referred in the page management record (via an
indirect call, which AFAIK is correctly pipelined).
The templates may not be limited to 16-byte alignments, but also cover
cases of 8 or 4 byte alignments for the MMs that use these, and would
essentially be targeted at the small transfers (loop overhead being
negligible when you copy around thousandths of kB).

The benefit would be "optimal" moves with no loop/branch overhead,
from a smaller codebase (when compared to manually unrolling the moves),
and callpoints with guaranteed alignments.
The various segregated blocks MMs have a rather limited variety of block
sizes, yet those sizes vary across MMs and when tweaking a MM,
so generating code automatically could be helpful.

Work would thus essentially focus on identifying the most efficient
instruction patterns for given transfer sizes/CPU combinations.

Eric
.


Loading