Re: from elsewhere, an assembler




"/\\o//\annabee" <Wannabee@xxxxxxxxxxxxxxx> schrieb im Newsbeitrag
news:op.tqlfvllpmjj8u8@xxxxxxxxxxxxxxx
På Tue, 10 Apr 2007 23:44:37 +0200, skrev Frank Kotler
<fbkotler@xxxxxxxxxxx>:

This needs about 45 cycles (incl BSWAP) on AMD,
but is quite long (128 bytes).
I'm curious how long it takes an Intel for it.

P4, in particular, has a reputation of being "really bad" on shifts. I
think of myself as an "AMD guy", but I'm running a P4 right now. I
haven't done any "timing" on it - haven't even confirmed the weird
results Herbert reported. I'll try to "get to it" (if the spirit moves
me). I have an idea it won't be good. May need a conditional jump - "if
Intel, call the other function"...

Rosasm hexprint is 5 times faster then Wolgangs code :) ?

I clocked wolfgang at between 666 and 777 cycles and variations (earlier
today)

(>800) now.

Hexprint at somewhat above 100 cycles. 145 or thereabouts.

i called hexprint like this:

And I tried this a few minutes ago:
___________________________________
[STDH: 0]
[Time: 0 0]
[HexPrintString: B$ ' ']

main:
_____
;cli ;wont do any good on NT
CPUID |RDTSC |mov D$time eax |mov D$time+4 edx

____________;TEST-AREA insert your code under test here:
;best avoid calls in here or
Betov_Hex2:
mov eax 012345678
mov ebx eax
mov ecx 8 |mov edi HexPrintString | add edi 7
std
L0: mov al bl | and al 0F | add al 030
cmp al 03a | jc L1> |add al 7
L1: stosb | shr ebx 4
Loop L0<
cld
___________
push edx |push eax
RDTSC |sub eax D$time |sbb edx D$time+4 |mov D$time eax |mov D$time+4 edx
pop eax |pop edx
;sti
___________
int3
push 0 |jmp 'KERNEL32.ExitProcess'
______________________________________

This needs reproducable 124 cycles here.

Looks like you just measure windoze background noise.


Now for the disturbing news. (to me at least)
If I put wolfgangs code, in front of my testcode, a few bytes ahead, it
clocks 272 cycles, but if I place it in another TITLE, many many many
bytes lower adress, then it clocks in at 800+ cycles.

First (new) caches and misalignment may spoil the test.

I tested also your way with 'calling' the routines,
and surprise surprise I also got weird results from 250 to 10000 cycles.
This are typical stack fetch penalties (and/or page-fault recovery)

So I added in front of the first RDTSC:
_________
[SDTH: 0]
push 0-11 |call 'KERNEL32.GetStdHandle' |mov D$StdH eax
pushad
popad
_________
just to have some stack already 'as used'

A more reliable comparision is always the direct check of
code parts by reducing windoze noise to a minimum.

If I do the same with Betov_Hex I get 598 cycles if I place it at the
very much lower adress, and 145 cycles if imidiatly ahead in the code.

I guess this is because of cache?

Yes.

Anyways, the Betov hexprint is :) faster.

No, this STOSB-loop takes 124 cycles (136 with call)
My solution need 45 cycles (58 with call)

And that one I can read and understand and reuse in two seconds,
whereas Wolfgangs I had to step in the debugger several times,
and I am not sure I get it anyway.

:)
the algo is easy (done for all 8 bytes):
add 06 ;the upper four bits are clear after the expansion anyway
and 010 ;this bit is set "if >0a"
shr 4 ;make this bit to bit0
mul 7 ;now we get either zero or seven
add ;previous saved + "0 or 7" + '30'


the same thing happens when I place it at lower adresses, just before the
testcode (Post code below), but to a lesser degree. I now get 602 cycles
for Wolfgangs code
and 374 for Betovs hexprint.

As above. Aviod noise measurement ;)

__
wolfgang



.



Relevant Pages

  • Software Square Root
    ... Three versions of the square root function are given in the provided ... DevPartner Profiler doesn't want to profile the sqrtf function directly so ... sub eax, 0x3f800000 ... The numbers are CPU cycles. ...
    (comp.programming)
  • Re: Can this loop be made faster ?
    ... > add edx, ecx ... > add eax, ecx ... Roughly 70-130 cycles gained. ... Will it matter when running on 10000 particles? ...
    (alt.lang.asm)
  • Re: Worth using registers instead of memory locations?
    ... > make swapping them into registers a mostly pointless exercise. ... mov eax,; 3 cycles ...
    (comp.lang.asm.x86)
  • Re: Result of Simple Comparison
    ... > xor eax, eax ... > cmp edx, ecx ... Athlon's mispredict penalty is something like 10 cycles ... either 6.5 or 7 cycles per iteration. ...
    (comp.lang.asm.x86)
  • Re: Adding to variable in bss segment
    ... If the value was not required to be in EAX, ... cycles as apposed to your 23 cycles. ... another 10 cycles of execution time to your code ... which renders your code totally inefficient in comparison. ...
    (comp.lang.asm.x86)