Re: compiler generated output



Spiro Trikaliotis schrieb:

> Hallo Skarmander,
>
> Skarmander <spamtrap@xxxxxxxxxx> schrieb:
>
> > Spiro Trikaliotis wrote:
>
> >> Skarmander <invalid@xxxxxxxxxxxxxx> wrote:
> >>
> >
> > Not my part of the text. This is Mark F. Haigh's post.
>
> Oh, I'm sorry. I was not aware I stripped the attribution to him, too.
> Anyway, from the number of ">", it was clear that this was not your
> text, wasn't it?
>
>
> >> Now, you are comparing something very weird. MSVC++ 6.0 is some years
> >> older than "last weeks gcc CVS".
> >>
> >
> > The original discussion wasn't comparing VC 6 with the latest gcc;
> > rather it was about whether the code generated by VC 6 in this case was
> > acceptable. One poster in comp.lang.c thought it was, quote, "crappy".
> > :-) I challenged him to show that improvement was actually possible, and
> > he (or rather gcc) came up with this.
>
> Yes, this is true. Anyway, this resulted in a comparison of both
> compilers, thus, my above statement still stands.
>
> > Don't use default settings, optimize for speed; and try to tell the
> > compiler to optimize for 686 or higher (but using the 80386 instruction
> > set only). That's what gcc was asked to do as well.
>
> Ok, I tried again, telling to optimized for speed (/Ot) and generate
> code for PPro, P-II, P-III (/G6; But: I don't know how to restrict the
> compiler to generate code which only uses 386 instruction). The
> resulting code is (for % 8, NOT for %8u):
>
> test!modTest1:
> 01001be0 8bff mov edi,edi
> 01001be2 55 push ebp
> 01001be3 8bec mov ebp,esp
> 01001be5 0fb74508 movzx eax,word ptr [ebp+0x8]
> 01001be9 0fb74d0c movzx ecx,word ptr [ebp+0xc]
> 01001bed 2bc1 sub eax,ecx
> 01001bef 2507000080 and eax,0x80000007
> 01001bf4 7905 jns test!modTest1+0x1b (01001bfb)
> 01001bf6 48 dec eax
> 01001bf7 83c8f8 or eax,0xfffffff8
> 01001bfa 40 inc eax
> 01001bfb 5d pop ebp
> 01001bfc c20800 ret 0x8
>
> Thus, while replacing mov/and with movzx, the jump is still there. (The
> same code is generated for P-IV/Athlon).
>

Ok, the same i had with msvc6. Not that bad for the old compiler ;-)
Looks funny - but of course suboptimal due to possible missprediction
of conditional jump targets.

If we look for signed modulo versus unsigned modulo:

a b a-b %8 &7 %8 binary &7 binary
0 0 0 0 0 0000 0000 0000 0000 0000 0000 0000 0000
0 1 -1 -1 7 1111 1111 1111 1111 0000 0000 0000 0111
0 2 -2 -2 6 1111 1111 1111 1110 0000 0000 0000 0110
0 3 -3 -3 5 1111 1111 1111 1101 0000 0000 0000 0101
0 4 -4 -4 4 1111 1111 1111 1100 0000 0000 0000 0100
0 5 -5 -5 3 1111 1111 1111 1011 0000 0000 0000 0011
0 6 -6 -6 2 1111 1111 1111 1010 0000 0000 0000 0010
0 7 -7 -7 1 1111 1111 1111 1001 0000 0000 0000 0001

we see, that the three lower bits are equal and the upper 13 (29) bits
are either zero (&7) or one for negative dividends.

Therefor, the "optimal" signed modulo assembly for 32-bit DWORDS may
look like this:

mov eax, [a]
sub eax, [b]
cdq ; sign extend eax to edx
and eax, 7 ; mask three lsb
and edx, ~7 ; mask 29 upper bits (shl edx, 3)
or eax, edx ; xor or add is also fine, since we have disjoint sets



>
> Throwing in some more variants: ;)
>
> Interestingly, the code for AMD64 is different (and not only because of
> the other register sizes):
>
> test!modTest1:
> 00000001`00001c60 6689542410 mov [rsp+0x10],dx
> 00000001`00001c65 66894c2408 mov [rsp+0x8],cx
> 00000001`00001c6a 0fb7442408 movzx eax,word ptr [rsp+0x8]
> 00000001`00001c6f 0fb74c2410 movzx ecx,word ptr [rsp+0x10]

Oups - using memory for zero extending 16 to 32-bit register!
Seems a very bad idea to use short as parameter or local or scalar
globals at all - confuses the compiler.

See
Software Optimization Guide for
AMD Athlon? 64 and AMD Opteron? Processors
Chapter 2 C and C++ Source-Level Optimizations 47
2.23 32-Bit Integral Data Types


> 00000001`00001c74 2bc1 sub eax,ecx
> 00000001`00001c76 99 cdq
> 00000001`00001c77 83e207 and edx,0x7
> 00000001`00001c7a 03c2 add eax,edx
> 00000001`00001c7c 83e007 and eax,0x7
> 00000001`00001c7f 2bc2 sub eax,edx
> 00000001`00001c81 c3 ret
>

Looks quite nice - but misses the and ~7 trick.

cdq ; sign extend eax to edx
and eax, 7 ; mask three lsb
and edx, ~7 ; mask 29 upper bits (shl edx, 3)
or eax, edx ; xor or add is also fine, since we have disjoint sets

Interesting the same sequence is mentioned in

Software Optimization Guide for
AMD Athlon? 64 and AMD Opteron? Processors
Chapter 8 Integer Optimizations 165

Remainder of Signed Division by 2n or -(2n)
; In: EAX = dividend
; Out: EAX = remainder
cdq ; Sign extend into EDX.
and edx, (2^n - 1) ; Mask correction (abs(divisor) - 1)
add eax, edx ; Apply pre-correction.
and eax, (2^n - 1) ; Mask out remainder (abs(divisor) - 1)
sub eax, edx ; Apply pre-correction if necessary.


> Thus, it generates "the same kind of code" it generates for x86 in the
> case I generate mod 8u, not 8!
>
> Now, the code for mod 8u looks like:
>
> test!modTest2:
> 00000001`00001c90 6689542410 mov [rsp+0x10],dx
> 00000001`00001c95 66894c2408 mov [rsp+0x8],cx
> 00000001`00001c9a 0fb7442408 movzx eax,word ptr [rsp+0x8]
> 00000001`00001c9f 0fb74c2410 movzx ecx,word ptr [rsp+0x10]
> 00000001`00001ca4 2bc1 sub eax,ecx
> 00000001`00001ca6 33d2 xor edx,edx
> 00000001`00001ca8 b908000000 mov ecx,0x8
> 00000001`00001cad f7f1 div ecx
> 00000001`00001caf 8bc2 mov eax,edx
> 00000001`00001cb1 c3 ret


Oups - totally weird! Seems all optimization disabled with 8u ;-)
The 64-bit ms-compiler still needs some lifting - high time for
assembler programmers!


>
> This totally confuses me. Here, it generates the same kind of code it
> generates for x86 in the case "mod 8". (The code for modTest3 is almost
> the same, only adding
>
> 00000001`0000xxxx 0fb7c0 movzx eax,ax
>
> before the "xor edx,edx" line.)
>
>
> Thus, to summarize:
>
> x86 AMD64
> mod 8 DIV, no jump AND, but with jump
> mod 8u AND, but jump DIV, no jump used
>
>
> > When optimizing for speed, I seriously doubt any compiler would use lame
> > ducks like cdq and idiv, let alone for dividing by a constant.

CDQ is no lame duck, but one cycle direct path.
IDIV is - 26/42/74 cyles vector path for 16/32/64-bit idiv.


>
> Well, the MS compiler used them.
>

Use 32-bit types and for hotspots inspect the assembly.
If necessary, give compilers some hint, eg. for the signed mod 8:

int deltaMod8 (int a, int b)
{
int delta = a - b;
int signx = delta >> 31; // arithmetic shift - like cdq
return (signx & ~7) | (delta & 7);
}

- Gerd


.



Relevant Pages

  • Re: C/C++ Compilers Optimization Failed
    ... I used C/C++ Compiler's Optimization. ... >> xor eax, eax ... I am shocked that C/C++ Compiler did not tune optimization very well ...
    (comp.lang.asm.x86)
  • Re: C/C++ Compilers Optimization Failed
    ... > xor eax, eax ... If you don't like the code generated by that compiler, ... And why are you complaining in an assembly language newsgroup about ... > (Optimization is not important YET!!) ...
    (comp.lang.asm.x86)
  • Re: Missed jmp Optimization?
    ... code that I'm posting about. ... This will generate a forward jump, ... that this optimization wasn't used because it wasn't a sure ... code without the jump should be generated by default by the compiler. ...
    (microsoft.public.dotnet.languages.vc)
  • Re: Compiler inserts redundant comparison against zero
    ... jl condition_false cmp eax, ... It's not like stunning new algorithms have been developed over the years that allow compilers to check whether expressions have side effects (this is what would prevent us from doing whatever optimization we like in the face of shortcut evaluation). ... If a compiler really missed opportunities because it's translating a && b as "test a; jump if false; test b; jump if false", no matter what a and b are, it's not really optimizing this at all. ...
    (comp.lang.asm.x86)
  • Re: Question about the ERL function
    ... but using line numbers does create a compiler ... switch which turns off some kinds of optimization. ... It isn't the ERL() function itself that is the cause of this - it is ... We normally think of the latter as only associated with jump commands - ...
    (microsoft.public.vb.general.discussion)