Re: greatest multiple algorithm




"Bronson" <pshemu@xxxxxxxxx> wrote in message
news:1183066874.218662.298320@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Hi,

i'm looking for a fast way to find greatest multiple of a, which is
smaller than b.
Recently i've used b-b%a, but maybe it could be done faster.


The answer to your question doesn't just depend on the algorithm. If the
cpu doesn't support modulo, division might be faster. For an x86 cpu, you
can use integer instructions to do division or modulo and/or floating point
instructions to do floating and integer division. It's possible that you
might find a faster algorithm using shifts: shr,shl,shrd,shrl. You might
also be able to code them using other instructions: mmx, xmm, sse, sse2,
etc. If you just use the integer or floating point instructions for x86,
there are at least three basic ways in x86 assembly to do (b/a)*a and one
for b-b%a. I've listed most of the required instructions needed to code a
routine for each method. The timings are for the pentium.

1) floating point floor(b/a)*a
FLD 1 cycle
FLD 1 cycle
FDIV 39 cycles
FRNDINT 9-20 cycles
FST 1-2 cycles
- at least 51 to 63 cycles

2) integer (b/a)*a - using x87 floating point unit
FILD 3/1 cycles
FILD 3/1 cycles
FIDIV 42 cycles
FIMUL 7/4 cycles
FIST 6 cycles
- at least 54 to 61 cycles

3) integer (b/a)*a - implicit truncation
MOV 1 cycle
MOV 1 cycle
DIV 40 cycles 32-bit
MUL 11 cycles
- at least 53 cycles

4) integer b-b%a - DIV performs modulo
MOV 1 cycle
MOV 1 cycle
DIV 40 cycles 32-bit
SUB 1 cycle
- at least 44 cycles


For these examples, the b-b%a is the faster on an x86 than (b/a)*a or
floor(b/a)*a just due to the instruction timings of the needed instructions.


Rod Pemberton

.



Relevant Pages

  • Re: Lies, damn lies and benchmarks
    ... When running using just the 16-bit registers, ... extra cycles when run on the 386 over the 286 (these were mostly system ... instructions which didn't get run too often anyways), ... The FPU was another story, the 287 FPU was usually run at an asynchronous ...
    (comp.security.misc)
  • Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes
    ... cycles to the bus. ... LOCK slowness is not because of the bus. ... maybe 150-200 regular pipelined, superscalar instructions. ...
    (Linux-Kernel)
  • Re: SSE2-Sort within a register
    ... register files. ... cycles. ... 128 bit SSEinstructions are split into Doubles ... Most 128 bit SSE and SSE2 ...
    (comp.lang.asm.x86)
  • Re: Adjusting PC Hyperthreading for Spice Simulation
    ... ago), 350 CPU cycles for a code cache miss was not atypical, but RAM ... delay in which a sequence of instructions totalling 100 cycles could be ... and others) support speculative execution and out of order execution ...
    (sci.electronics.design)
  • Re: hobby project - 16 bit digital audio mixer using m68k
    ... how many clock cycles are required by average instructions. ... I would suggest using some more modern processor requiring less ...
    (comp.arch.embedded)