Re: COMPARE HLL/ASM




Wannabee skrev:

Ok, here is the whole story yet:

It's not optimised at all and there might be faster algos too,
but it contains not a single branch and uses only four GP-regs,
so its timing isn't value dependend.

I got ~38 cycles on AMD K7 under KESYS (cache aligned, prefetched),
and an average of ~150 cycles on AMD64 with XP-home.
It may be a bit slower on Intel CPUs because of the Shifts.

As I saw in the other post right now you meant a 64 bit result,
so I expanded my first version to 128->64 bits it now shows
85 cycles on the KESYS-K7 and ~300 with windoze.
Seems this M$-stuff got heavy cache-issues ;)

So I tried also a short code version and what a surprise it now
takes again ~150 cycles per pass with windoze (~95 with mine).
I think to measure just all the cache miss penalties and the time
our code takes is a minor factor on windoze.
Hope you can get better figures with Linux.

I timed the first one to 152 cycles on win2000.
The second one is 500+ cycles here... (K7)
Is there a faster way to "pack" those BCD
numbers? Are there MMX or SSE instructions that
does it any faster? Like for instance convert the whole
thing at once? (I was looking but sofar I cant find the once I want
but it seems it got instructions dealing with "(un)packed" bcd..?)

I've seen combinations of:
MASKMOVDQU PACKUSBW PMINUB/PMAXUB PSADBW (a bit detouring, so no gain)

What I found really usable from SSE/XMM were the PADD..PXOR group,
but me too miss a PSHUFUB instruction.

Anyway, thanks for the code, I keep it and try to learn from it.

:) be aware that this code examples are just fast typed hacks ...
I actually just reversed the functionality of 'our' 48 Cycle Test.

__
wolfgang



.



Relevant Pages

  • Re: Lies, damn lies and benchmarks
    ... When running using just the 16-bit registers, ... extra cycles when run on the 386 over the 286 (these were mostly system ... instructions which didn't get run too often anyways), ... The FPU was another story, the 287 FPU was usually run at an asynchronous ...
    (comp.security.misc)
  • Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes
    ... cycles to the bus. ... LOCK slowness is not because of the bus. ... maybe 150-200 regular pipelined, superscalar instructions. ...
    (Linux-Kernel)
  • Re: SSE2-Sort within a register
    ... register files. ... cycles. ... 128 bit SSEinstructions are split into Doubles ... Most 128 bit SSE and SSE2 ...
    (comp.lang.asm.x86)
  • Re: Adjusting PC Hyperthreading for Spice Simulation
    ... ago), 350 CPU cycles for a code cache miss was not atypical, but RAM ... delay in which a sequence of instructions totalling 100 cycles could be ... and others) support speculative execution and out of order execution ...
    (sci.electronics.design)
  • Re: hobby project - 16 bit digital audio mixer using m68k
    ... how many clock cycles are required by average instructions. ... I would suggest using some more modern processor requiring less ...
    (comp.arch.embedded)