Re: No need to optimize in assembly anymore

From: Matt Taylor (para_at_tampabay.rr.com)
Date: 05/31/04


Date: Mon, 31 May 2004 01:50:29 +0000 (UTC)


"a" <a@a.com> wrote in message news:bSTpc.306$JL5.73@newsfe1-win...
> I didn't say optimization is undesirable. In fact for a compiler it
should
> always be done. What I meant was that hand optimized code in assembly may
> no longer be useful as it might run well for one processor but poorly for
> another.

I answered this comment already and directly stated that it was false.

> The only optimizations that will be useful are the high level ones that
work
> across CPU families such as loop unrolling, data alignment, function
> inlining etc. - precisely the ones that a good compiler will perform.

Compilers are also egregiously shortsighted. Until MSVC 7, the following
code would generate a mul and a div:

int x, y, z;
// ...
y = x / 10;
z = x % 10;

Now in MSVC 7.1 it generates a single div. This is several times slower than
the code that GCC 3.2 produces. Of course, MSVC gets it right if you remove
the modulo calculation.

Another case which I found interesting involves STL's min function:

int min_int(int a, int b)
{
 return std::min(a, b);
}

MSVC 7.1:
 mov eax, DWORD PTR _b$[esp-4]
 cmp eax, DWORD PTR _a$[esp-4]
 lea eax, DWORD PTR _b$[esp-4]
 jl SHORT $L4471
 lea eax, DWORD PTR _a$[esp-4]
$L4471:
 mov eax, DWORD PTR [eax]
 ret 0

GCC 3.2:
 mov eax, DWORD PTR [esp+4]
 cmp DWORD PTR [esp+8], eax
 jge L2
 lea eax, [esp+8]
L3:
 mov eax, DWORD PTR [eax]
 ret
 .p2align 4,,7
L2:
 lea eax, [esp+4]
 jmp L3

This is particularly egregious because the compiler is inlining the std::min
function, and both GCC and MSVC make the mistake of preserving
pass-by-reference semantics in the inlined code. Even an assembly novice
would not likely do this.

Let's also not forget MMX and SSE which many popular compilers cannot take
advantage of except through intrinsics. AFAIK Intel C++ is the only popular
C/C++ compiler that can vectorize loops. MSVC can't, and I haven't seen such
an option in GCC yet either. Last I checked, Borland can't. Watcom can't.
Etc.

Try casting a 32-bit int to a 64-bit int and multiplying by another 32-bit
int, for example. You would expect the compiler to use the extended multiply
instruction (mul/imul with 1 operand). Usually MSVC emits a call to _ullmul;
it is very picky about that particular idiom. There are also many, many
cases involving templates or references/pointers where the compiler does a
piss-poor job of optimizing.

Compilers don't reason about code like humans do. Compilers are built from
cases, and the cases are never exhaustive. If your software has no
particular need for performance, then you probably don't have a particular
need to be writing assembly for it. There are, however, a large number of
niche cases where the compiler performs poorly.

> Optimizations such as instruction rescheduling to prevent stalling the
> pipeline and reorganisation of code for better use of cache will be
> considered a waste of time as they will only work for a particular
> processor.

Optimal code for one processor is not optimal for another; however, optimal
code for one processor is still probably better than unscheduled code for
another processor. Timings are often very similar for primitive ops (ALU,
bswap, others). Scheduled code will try to minimize the amount of time it
takes to get through the critical path, and since timings are similar this
will be the same path for other processors. This will result in an
improvement for *all* processors.

> Perhaps asssembly days are numbered?.....
<snip>

People have been saying this for years and assembly is still actively used.
Unless compilers make a dramatic leap forward, this is unlikely to ever
become true--software is always pushing the envelope, though some people
tend to forget this because it isn't true of every market.

-Matt



Relevant Pages

  • Re: C/C++ Compilers Optimization Failed
    ... I used C/C++ Compiler's Optimization. ... >> xor eax, eax ... I am shocked that C/C++ Compiler did not tune optimization very well ...
    (comp.lang.asm.x86)
  • Re: missing optimization?
    ... Both, function1 and function2, have duplicated ... > movl _data, %eax ... > subl 4, %eax ... > If compiler is *newer* than 2.95, then it puts the TEST opcode. ...
    (comp.os.msdos.djgpp)
  • Re: C/C++ Compilers Optimization Failed
    ... > xor eax, eax ... If you don't like the code generated by that compiler, ... And why are you complaining in an assembly language newsgroup about ... > (Optimization is not important YET!!) ...
    (comp.lang.asm.x86)
  • Re: Promotions while delivering parameters
    ... On my particular compiler, I get: ... movsx eax, BYTE PTR _c$ ... mov ecx, DWORD PTR _i$ ...
    (comp.lang.c.moderated)
  • Re: compiler generated output
    ... > compiler to generate code which only uses 386 instruction). ... > Thus, while replacing mov/and with movzx, the jump is still there. ... mov eax, ... Software Optimization Guide for ...
    (comp.lang.asm.x86)