Re: some advice

From: C (blackmarlin_at_asean-mail.com)
Date: 07/03/04


Date: 3 Jul 2004 05:45:00 -0700

psi <i@m.t> wrote in message news:<fro5e0p57m29l5nr730u50podpjmu8b2kt@4ax.com>...

[snip]

> >Translating to assembly (NASM) we get...
> >(using my macros -- see luxasm.sf.net
> >CVS::luxasm/_nasm/macros/)
> >
> >%include "process.aml" ; Macro defs
> >
> >type Num ; Define number type
> > variable len, dword
> > variable num, pointer
> >end_type
> >
> >procedure minoa, a, b ; const Num *a, *b
> > uses ecx, edx, esi, edi
> >
> > mov esi, [ local a ] ; Get parameters in regs
> > mov edi, [ local b ]
> > mov ecx, [ eax + Num.len ] ; Load lengths
> > mov edx, [ ebx + Num.len ]
> > mov esi, [ esi + Num.num ] ; Load ptrs
> > mov edi, [ edi + Num.num ]
> > if ecx, equal, edx ; Check a->len == b->len
>
> it seems to me here --edx, --ecx

I would normally have the length adjusted before,
though you could modify the following 'lea'
instructions to give.

lea edi, [ edi + edx * 4 - 4 ] ; Set aa, bb
lea esi, [ esi + ecx * 4 - 4 ]

> > lea edi, [ edi + edx * 4 ] ; Set aa, bb
> > lea esi, [ esi + ecx * 4 ]
> > do ; do {
> > mov eax, [ esi ] ; If *esi == *edi goto .diff
> > add esi, 4
>
> and sub esi, 4

Whoopsy daisy: that is why I always mention
the code is untested :-) [I actually ment
to write 'add esi, -4', though your 'sub'
will work just as well.]

> > cmp eax, [ edi ]
> > jne .diff
> > add edi, 4
>
> and sub edi, 4

Same mistake twice, ouch :-(

> > loop_until decrement_to_zero, ecx ; } while -- ecx > 0
> > clr eax ; Return eax = 0 (match)
> > return
> > end_if
> >
> >.diff:
> > sbb eax, eax ; eax = carry flag ? -1 : 0
> > lea eax, [ eax * 2 + 1 ] ; eax = ( eax == 0 ? 1 : -1 )
>
> nice trick, thank you

:-)
 
> >end_procedure ; Return eax = 1:0:-1 (gt/eq/lt)
> >
> >You could also unroll the loop a bit, which should
> >make things even faster. (Note this is for unsigned
> >numbers only -- I have not worked out if signed
> >numbers will work in a similar fashion.) As normal,
> >code untested -- you have been warned, blah, blah.
> >
> >Also, if you want to optimise for size, then STD \\
> >REPE CMPSD may be what you what.
> >
> >C
> >2004-06-29
>
> this seems fast too. the loop seems faster

Yes, though I think the REP will be faster
on eariler x86 processors, but anything later
than a 586 will definately perform better with
the loop.
 
> _aminor1_u:
> push ebp
> mov ebp, esp

Normally I do not bother setting up a stack
frame using ebp -- it wastes cpu cycles and
is only useful if you are unsure of your
current position relative to the local
variables and/or parameters (ie. cannot work
it out statically). Referencing locals &
parameters via esp is a better solution, and
frees up a register (ebp) for other uses.
(That was my main reason for writing a macro
library -- to manage such locals and parameter
references via esp automatically.)
                                   
> push ebx
> push esi
> push edi

This would only be required to when certain
HLLs (eg. GCC) call the function, it can be
skipped (with some time savings) if the
library is for pure assembler use only.

(I suspect that using 'mov' to the stack
frame would be faster than 'push' on later
x86 processors -- though I have not verified
this hypothesis.)

> ;------------------*/
> mov esi, [ebp+8]
> mov edi, [ebp+12]
> mov eax, [esi]

Instead of the following mov/cmp/jne you
could do...

     cmp eax, [ edi ]
     jne .esci

> mov ebx, [edi]
> cmp eax, ebx
> jne .esci

'add eax, 1' works better than 'dec eax'
on the Pentium 4, and makes no difference
on other processors.

> dec eax ; jz .ans
> mov esi, [esi + num_off]
> mov edi, [edi + num_off]

No need for the following 'or', the zero
flag is set by the 'dec' (or substituted
'add') instruction.

> or eax, eax
> jz .ans
> shl eax, 2

Aligning the loop can help speed things up
a bit. Try...

  align 16, db 0x90

> .loiiii:
> mov ebx, [esi + eax]
> cmp ebx, [edi + eax]
> jne .esci
> sub eax, 4
> jnz .loiiii

Yes, that will work as well, though the more
complex addressing modes used at the start of
a loop can have major performance penalties
on eariler x86 processors -- particularly
the 386. (This will not be an issue if other
parts of your code require a 586+ of course.)

> .ans:
> mov ebx, [esi]
> cmp ebx, [edi]
> jne .esci
> jmp short .fine

Why not replace the preceeding jne/jmp with
      je .fine
     
> .esci:
> sbb eax , eax ; CF==1 ? a= -1 : a= 0
> lea eax , [ 2*eax + 1 ]
> ; neg a
> .fine:
> pop edi
> pop esi
> pop ebx
> mov esp, ebp
> pop ebp

You have already heard my 'don't waste ebp on
a stack frame' rant -- I refer you to it again.
:-)

> ret
>
> _aminor1_u:
> {< k | k = s
> < b, i, j
> /*------------------*/
> i = [k+8]; j = [k+12]; a = [i]; b = [j]
> a==b ! .esci
> {--a /* jz .ans slow
> i = [i + num_off];
> j = [j + num_off];
> a|=a | jz .ans
> a<<=2 ;
> .loiiii:
> b =[i + a]
> b!=[j + a] ? .esci
> a-=4 | jnz .loiiii
> .ans:
> b =[i]
> b!=[j] ? .esci
> jmp short .fine
> }
> .esci:
> sbb a, a /* CF==1 ? a= -1 : a= 0
> lea a, [ 2*a + 1 ]
> /* neg a nel caso di minor()
> .fine:
> > b, i, j
> s = k | > k
> ret
> }

You know, that syntax really reminds me of SAss
(one of my early attempts at writing an assembler)
though there are some obvious differences. Where
can you get a copy of this assembler? I would
like to add it to my collection.

C
2004-07-03

PS: Do you happen to have routines to do a long
multiplication / division / modulus -- I seem
to have misplaced my version of them and need
something similar for my current project (Luxasm).



Relevant Pages