Re: from elsewhere, an assembler



På Tue, 10 Apr 2007 23:44:37 +0200, skrev Frank Kotler <fbkotler@xxxxxxxxxxx>:

This needs about 45 cycles (incl BSWAP) on AMD,
but is quite long (128 bytes).
I'm curious how long it takes an Intel for it.

P4, in particular, has a reputation of being "really bad" on shifts. I think of myself as an "AMD guy", but I'm running a P4 right now. I haven't done any "timing" on it - haven't even confirmed the weird results Herbert reported. I'll try to "get to it" (if the spirit moves me). I have an idea it won't be good. May need a conditional jump - "if Intel, call the other function"...

Rosasm hexprint is 5 times faster then Wolgangs code :) ?

I clocked wolfgang at between 666 and 777 cycles and variations (earlier today)

(>800) now.

Hexprint at somewhat above 100 cycles. 145 or thereabouts.

i called hexprint like this:

Betov_Hex:
mov ebx eax
mov ecx 8, edi HexPrintString | add edi 7
std
Do
mov al bl | and al 0F | add al '0'
On al > '9', add al 7
stosb | shr ebx 4
Do_Loop
cld
ret

this adress memory and etc.

Now for the disturbing news. (to me at least)
If I put wolfgangs code, in front of my testcode, a few bytes ahead, it clocks 272 cycles, but if I place it in another TITLE, many many many bytes lower adress, then it clocks in at 800+ cycles.

If I do the same with Betov_Hex I get 598 cycles if I place it at the very much lower adress, and 145 cycles if imidiatly ahead in the code.

I guess this is because of cache?

Anyways, the Betov hexprint is :) faster.
And that one I can read and understand and reuse in two seconds,
whereas Wolfgangs I had to step in the debugger several times,
and I am not sure I get it anyway.

the same thing happens when I place it at lower adresses, just before the testcode (Post code below), but to a lesser degree. I now get 602 cycles for Wolfgangs code
and 374 for Betovs hexprint.


Below is the complete code used in the timings, except for the GUI code.
This code is run in USER mode realtime priority, and runs as the result of
clicking a menuitem:


First listed is the two routines at lower adresses.
Then the testroutine
then the same two routines at higher adresses.

For the 800+ cycles rememeber they use _much_ lower adresses.


Betov_Hex2:
mov ebx eax
mov ecx 8, edi HexPrintString | add edi 7
std
Do
mov al bl | and al 0F | add al '0'
On al > '9', add al 7
stosb | shr ebx 4
Do_Loop
cld
ret

WolfGang_BinToAscci2:
______________________________
;eax [bin] to edx:eax [HEX-ascii]:
; it uses only four registers and no memory
xor edx,edx
xor ebx,ebx
; expand nibbles to bytes:
shld edx,eax,4
shl eax,4
shl edx,4
shld edx,eax,4
shl eax,4
shl edx,4
shld edx,eax,4
shl eax,4
shl edx,4
shld edx,eax,4
shl eax,4
shld ebx,eax,4
shl eax,4
shl ebx,4
shld ebx,eax,4
shl eax,4
shl ebx,4
shld ebx,eax,4
shl eax,4
shl ebx,4
shld ebx,eax,4
shl eax,4
;copy:
mov ecx,ebx
mov eax,edx
;the algo:
add eax,06060606h
add ecx,06060606h
and eax,10101010h
and ecx,10101010h
shr eax,4
shr ecx,4
imul eax,07h
imul ecx,07h
lea eax,D$eax+edx+30303030h
lea edx,D$ecx+ebx+30303030h
;done but for you perhaps wrong ordered yet, so I add:
bswap eax
bswap edx
ret

;;
This is the test/timing code
;;

[TestVariable: ? ? ?]
TestCode:
push edi
CPUID | rdtsc | push eax edx
mov eax 0-1
;call WolfGang_BinToAscci
;call WolfGang_BinToAscci2
call Betov_Hex
;call Betov_Hex2
rdtsc | pop ecx ebx
sub eax ebx
sbb edx ecx
int 3
pop edi
ret

Betov_Hex:
mov ebx eax
mov ecx 8, edi HexPrintString | add edi 7
std
Do
mov al bl | and al 0F | add al '0'
On al > '9', add al 7
stosb | shr ebx 4
Do_Loop
cld
ret

WolfGang_BinToAscci:
______________________________
;eax [bin] to edx:eax [HEX-ascii]:
; it uses only four registers and no memory
xor edx,edx
xor ebx,ebx
; expand nibbles to bytes:
shld edx,eax,4
shl eax,4
shl edx,4
shld edx,eax,4
shl eax,4
shl edx,4
shld edx,eax,4
shl eax,4
shl edx,4
shld edx,eax,4
shl eax,4
shld ebx,eax,4
shl eax,4
shl ebx,4
shld ebx,eax,4
shl eax,4
shl ebx,4
shld ebx,eax,4
shl eax,4
shl ebx,4
shld ebx,eax,4
shl eax,4
;copy:
mov ecx,ebx
mov eax,edx
;the algo:
add eax,06060606h
add ecx,06060606h
and eax,10101010h
and ecx,10101010h
shr eax,4
shr ecx,4
imul eax,07h
imul ecx,07h
lea eax,D$eax+edx+30303030h
lea edx,D$ecx+ebx+30303030h
;done but for you perhaps wrong ordered yet, so I add:
bswap eax
bswap edx
ret



Best,
Frank



--
.



Relevant Pages

  • Re: from elsewhere, an assembler
    ... but "das" is so slow ... shld edx,eax,4 ... shl eax,4 ... Of course, the most likely reason to convert nibbles to hex ascii is "human convenience", and the human can't read 'em nearly as fast as our *slowest* method, so... ...
    (alt.lang.asm)
  • Re: from elsewhere, an assembler
    ... but "das" is so slow ... But single nibble conversion loops will always be slower than ... shld edx,eax,4 ... shl eax,4 ...
    (alt.lang.asm)
  • Re: from elsewhere, an assembler
    ... > cmp al,10 ... but "das" is so slow ... shld edx,eax,4 ... shl eax,4 ...
    (alt.lang.asm)
  • PutPixel v2.0
    ... keep using the SHL opcode, which performs an extremely fast power of ... Now, for use with a specific resolution, the best, fastest code would ... cells di mov \ buffer address to EDI ... The Y*320 calculation is performed by a pair of SHL (bit shift left) ...
    (comp.lang.forth)
  • Re: Shift in Parallel?
    ... right shift instruction that they can overhead with x86 CPU time. ... mov bl,bit6 ... lea eax, ... shl al,1 ...
    (comp.lang.asm.x86)