Re: compiler generated output
- From: "Gerd Isenberg" <spamtrap@xxxxxxxxxx>
- Date: Mon, 24 Oct 2005 22:17:37 +0000 (UTC)
Spiro Trikaliotis schrieb:
> Hallo Skarmander,
>
> Skarmander <spamtrap@xxxxxxxxxx> schrieb:
>
> > Spiro Trikaliotis wrote:
>
> >> Skarmander <invalid@xxxxxxxxxxxxxx> wrote:
> >>
> >
> > Not my part of the text. This is Mark F. Haigh's post.
>
> Oh, I'm sorry. I was not aware I stripped the attribution to him, too.
> Anyway, from the number of ">", it was clear that this was not your
> text, wasn't it?
>
>
> >> Now, you are comparing something very weird. MSVC++ 6.0 is some years
> >> older than "last weeks gcc CVS".
> >>
> >
> > The original discussion wasn't comparing VC 6 with the latest gcc;
> > rather it was about whether the code generated by VC 6 in this case was
> > acceptable. One poster in comp.lang.c thought it was, quote, "crappy".
> > :-) I challenged him to show that improvement was actually possible, and
> > he (or rather gcc) came up with this.
>
> Yes, this is true. Anyway, this resulted in a comparison of both
> compilers, thus, my above statement still stands.
>
> > Don't use default settings, optimize for speed; and try to tell the
> > compiler to optimize for 686 or higher (but using the 80386 instruction
> > set only). That's what gcc was asked to do as well.
>
> Ok, I tried again, telling to optimized for speed (/Ot) and generate
> code for PPro, P-II, P-III (/G6; But: I don't know how to restrict the
> compiler to generate code which only uses 386 instruction). The
> resulting code is (for % 8, NOT for %8u):
>
> test!modTest1:
> 01001be0 8bff mov edi,edi
> 01001be2 55 push ebp
> 01001be3 8bec mov ebp,esp
> 01001be5 0fb74508 movzx eax,word ptr [ebp+0x8]
> 01001be9 0fb74d0c movzx ecx,word ptr [ebp+0xc]
> 01001bed 2bc1 sub eax,ecx
> 01001bef 2507000080 and eax,0x80000007
> 01001bf4 7905 jns test!modTest1+0x1b (01001bfb)
> 01001bf6 48 dec eax
> 01001bf7 83c8f8 or eax,0xfffffff8
> 01001bfa 40 inc eax
> 01001bfb 5d pop ebp
> 01001bfc c20800 ret 0x8
>
> Thus, while replacing mov/and with movzx, the jump is still there. (The
> same code is generated for P-IV/Athlon).
>
Ok, the same i had with msvc6. Not that bad for the old compiler ;-)
Looks funny - but of course suboptimal due to possible missprediction
of conditional jump targets.
If we look for signed modulo versus unsigned modulo:
a b a-b %8 &7 %8 binary &7 binary
0 0 0 0 0 0000 0000 0000 0000 0000 0000 0000 0000
0 1 -1 -1 7 1111 1111 1111 1111 0000 0000 0000 0111
0 2 -2 -2 6 1111 1111 1111 1110 0000 0000 0000 0110
0 3 -3 -3 5 1111 1111 1111 1101 0000 0000 0000 0101
0 4 -4 -4 4 1111 1111 1111 1100 0000 0000 0000 0100
0 5 -5 -5 3 1111 1111 1111 1011 0000 0000 0000 0011
0 6 -6 -6 2 1111 1111 1111 1010 0000 0000 0000 0010
0 7 -7 -7 1 1111 1111 1111 1001 0000 0000 0000 0001
we see, that the three lower bits are equal and the upper 13 (29) bits
are either zero (&7) or one for negative dividends.
Therefor, the "optimal" signed modulo assembly for 32-bit DWORDS may
look like this:
mov eax, [a]
sub eax, [b]
cdq ; sign extend eax to edx
and eax, 7 ; mask three lsb
and edx, ~7 ; mask 29 upper bits (shl edx, 3)
or eax, edx ; xor or add is also fine, since we have disjoint sets
>
> Throwing in some more variants: ;)
>
> Interestingly, the code for AMD64 is different (and not only because of
> the other register sizes):
>
> test!modTest1:
> 00000001`00001c60 6689542410 mov [rsp+0x10],dx
> 00000001`00001c65 66894c2408 mov [rsp+0x8],cx
> 00000001`00001c6a 0fb7442408 movzx eax,word ptr [rsp+0x8]
> 00000001`00001c6f 0fb74c2410 movzx ecx,word ptr [rsp+0x10]
Oups - using memory for zero extending 16 to 32-bit register!
Seems a very bad idea to use short as parameter or local or scalar
globals at all - confuses the compiler.
See
Software Optimization Guide for
AMD Athlon? 64 and AMD Opteron? Processors
Chapter 2 C and C++ Source-Level Optimizations 47
2.23 32-Bit Integral Data Types
> 00000001`00001c74 2bc1 sub eax,ecx
> 00000001`00001c76 99 cdq
> 00000001`00001c77 83e207 and edx,0x7
> 00000001`00001c7a 03c2 add eax,edx
> 00000001`00001c7c 83e007 and eax,0x7
> 00000001`00001c7f 2bc2 sub eax,edx
> 00000001`00001c81 c3 ret
>
Looks quite nice - but misses the and ~7 trick.
cdq ; sign extend eax to edx
and eax, 7 ; mask three lsb
and edx, ~7 ; mask 29 upper bits (shl edx, 3)
or eax, edx ; xor or add is also fine, since we have disjoint sets
Interesting the same sequence is mentioned in
Software Optimization Guide for
AMD Athlon? 64 and AMD Opteron? Processors
Chapter 8 Integer Optimizations 165
Remainder of Signed Division by 2n or -(2n)
; In: EAX = dividend
; Out: EAX = remainder
cdq ; Sign extend into EDX.
and edx, (2^n - 1) ; Mask correction (abs(divisor) - 1)
add eax, edx ; Apply pre-correction.
and eax, (2^n - 1) ; Mask out remainder (abs(divisor) - 1)
sub eax, edx ; Apply pre-correction if necessary.
> Thus, it generates "the same kind of code" it generates for x86 in the
> case I generate mod 8u, not 8!
>
> Now, the code for mod 8u looks like:
>
> test!modTest2:
> 00000001`00001c90 6689542410 mov [rsp+0x10],dx
> 00000001`00001c95 66894c2408 mov [rsp+0x8],cx
> 00000001`00001c9a 0fb7442408 movzx eax,word ptr [rsp+0x8]
> 00000001`00001c9f 0fb74c2410 movzx ecx,word ptr [rsp+0x10]
> 00000001`00001ca4 2bc1 sub eax,ecx
> 00000001`00001ca6 33d2 xor edx,edx
> 00000001`00001ca8 b908000000 mov ecx,0x8
> 00000001`00001cad f7f1 div ecx
> 00000001`00001caf 8bc2 mov eax,edx
> 00000001`00001cb1 c3 ret
Oups - totally weird! Seems all optimization disabled with 8u ;-)
The 64-bit ms-compiler still needs some lifting - high time for
assembler programmers!
>
> This totally confuses me. Here, it generates the same kind of code it
> generates for x86 in the case "mod 8". (The code for modTest3 is almost
> the same, only adding
>
> 00000001`0000xxxx 0fb7c0 movzx eax,ax
>
> before the "xor edx,edx" line.)
>
>
> Thus, to summarize:
>
> x86 AMD64
> mod 8 DIV, no jump AND, but with jump
> mod 8u AND, but jump DIV, no jump used
>
>
> > When optimizing for speed, I seriously doubt any compiler would use lame
> > ducks like cdq and idiv, let alone for dividing by a constant.
CDQ is no lame duck, but one cycle direct path.
IDIV is - 26/42/74 cyles vector path for 16/32/64-bit idiv.
>
> Well, the MS compiler used them.
>
Use 32-bit types and for hotspots inspect the assembly.
If necessary, give compilers some hint, eg. for the signed mod 8:
int deltaMod8 (int a, int b)
{
int delta = a - b;
int signx = delta >> 31; // arithmetic shift - like cdq
return (signx & ~7) | (delta & 7);
}
- Gerd
.
- Follow-Ups:
- Re: compiler generated output
- From: Gerd Isenberg
- Re: compiler generated output
- From: Skarmander
- Re: compiler generated output
- References:
- Re: compiler generated output
- From: Skarmander
- Re: compiler generated output
- From: Spiro Trikaliotis
- Re: compiler generated output
- Prev by Date: Re: improve strlen
- Next by Date: Re: improve strlen
- Previous by thread: Re: [Clax86list] Re: compiler generated output
- Next by thread: Re: compiler generated output
- Index(es):
Relevant Pages
|