Re: from elsewhere, an assembler
- From: "Wolfgang Kern" <nowhere@xxxxxxxxxxx>
- Date: Tue, 10 Apr 2007 14:53:24 +0200
Frank wrote:
...
[convert nibble to hex-ascii]
cmp al,10
jc +2
add al,7
add al,48
cmp al, 10
sbb al, 69h
das
Shorter, and eliminates the conditional jump... but "das" is so slow
(how slow *is* it?), I don't think it's a "win"...
DAS latency is reported as 8 cycles in AMD-docs
Intel docs describe what it does:
IF ((AL AND 0FH) > 9) OR (AF = 1)
THEN
AL <- AL - 6;
AH <- AH - 1;
AF <- 1;
CF <- 1;
ELSE
CF <- 0;
AF <- 0;
FI;
AL <- AL AND 0FH;
You see it may alter AH as well, which may spoil the game.
DAS is an invalid instruction in 64-bit mode.
What would be your idea of a "fast" way to do it?
IIRC we've seen many variants in the fastest shortes discussion
some time ago in CLAX.
My 8 byte solution (3.5 cycels) wins in the aspect of using
no other registers nor memory. The cc-branch will produce a
penalty if used in a loop (every 9th iteration IIRC).
The short five byte way (10 cycles) and uses AH.
Unfortunately CMOV doesn't have an IMM nor any 8-bit form, so
mov edx,3007h
mov ebx,0
cmp al,0a
CMOV ebx,edx ;replace jc
add al,bl
add al,dh
may not suffer from branch-penalties, but you see how awful...
But single nibble conversion loops will always be slower than
fix-sized 32 or 64 bit solutions like the dw-conversion I use:
______________________________
;eax [bin] to edx:eax [HEX-ascii]:
; it uses only four registers and no memory
xor edx,edx
xor ebx,ebx
; expand nibbles to bytes:
shld edx,eax,4
shl eax,4
shl edx,4
shld edx,eax,4
shl eax,4
shl edx,4
shld edx,eax,4
shl eax,4
shl edx,4
shld edx,eax,4
shl eax,4
shld ebx,eax,4
shl eax,4
shl ebx,4
shld ebx,eax,4
shl eax,4
shl ebx,4
shld ebx,eax,4
shl eax,4
shl ebx,4
shld ebx,eax,4
shl eax,4
;copy:
mov ecx,ebx
mov eax,edx
;the algo:
add eax,06060606h
add ecx,06060606h
and eax,10101010h
and ecx,10101010h
shr eax,4
shr ecx,4
imul eax,07h
imul ecx,07h
lea eax,[eax+edx+30303030h]
lea edx,[ecx+ebx+30303030h]
;done but for you perhaps wrong ordered yet, so I add:
bswap eax
bswap edx
_________;end
This needs about 45 cycles (incl BSWAP) on AMD,
but is quite long (128 bytes).
I'm curious how long it takes an Intel for it.
I played around with xmm-code, but I found the overhead with
load/store in memory eats all the advantage with PUNPCKLB,...,POR.
__
wolfgang
.
- Follow-Ups:
- Re: from elsewhere, an assembler
- From: Frank Kotler
- Re: from elsewhere, an assembler
- From: /\\\\o//\\annabee
- Re: from elsewhere, an assembler
- From: Herbert Kleebauer
- Re: from elsewhere, an assembler
- References:
- from elsewhere, an assembler
- From: cr88192
- Re: from elsewhere, an assembler
- From: SpooK
- Re: from elsewhere, an assembler
- From: cr88192
- Re: from elsewhere, an assembler
- From: Betov
- Re: from elsewhere, an assembler
- From: Herbert Kleebauer
- Re: from elsewhere, an assembler
- From: Betov
- Re: from elsewhere, an assembler
- From: Herbert Kleebauer
- Re: from elsewhere, an assembler
- From: Betov
- Re: from elsewhere, an assembler
- From: Herbert Kleebauer
- Re: from elsewhere, an assembler
- From: Betov
- Re: from elsewhere, an assembler
- From: Wolfgang Kern
- Re: from elsewhere, an assembler
- From: /\\\\o//\\annabee
- Re: from elsewhere, an assembler
- From: Wolfgang Kern
- Re: from elsewhere, an assembler
- From: /\\\\o//\\annabee
- Re: from elsewhere, an assembler
- From: Wolfgang Kern
- Re: from elsewhere, an assembler
- From: Frank Kotler
- from elsewhere, an assembler
- Prev by Date: Re: from elsewhere, an assembler
- Next by Date: Re: from elsewhere, an assembler
- Previous by thread: Re: from elsewhere, an assembler
- Next by thread: Re: from elsewhere, an assembler
- Index(es):
Relevant Pages
|