Re: Cannot optimize 64bit Linux code




<legrape@xxxxxxxxx> wrote in message news:83f5f291-4c86-48f6-8625-5ead760a46bf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
I am porting a piece of C code to 64bit on Linux. I am using 64bit
integers. It is a floating point intensive code and when I compile
(gcc) on 64 bit machine, I don't see any runtime improvement when
optimizing -O3. If I construct a small program I can get significant
(>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
machine, it runs 5x faster on the 64 bit machine than does the 64bit
compiled code.

It seems like something is inhibiting the optimization. Someone on
comp.lang.fortran suggested it might be an alignment problem. I am
trying to go through and eliminate all 32 bit integers righ now (this
is a pretty large hunk of code). But thought I would survey this
group, in case it is something naive I am missing.

Any opinion is welcomed. I really need this to run up to speed, and I
need the big address space. Thanks in advance.



OT:

this is actually an issue related to the mismatch between current processor performance behavior, and the calling conventions used on Linux x86-64.

they were like:
let's base everything on a variant of the "register" calling convention, and use SSE for all the floating point math rather than crufty old x87.

the problem is that, current processors don't quite agree, and in practice this sort of thing actually goes *slower*...

it seems, actually, that x87, lots of mem loads/stores, and complex addressing forms, can be used to better effect wrt performance than SSE, register-heavy approaches, and the use of "simple" addressing forms (in seeming opposition to current "optimization wisdom").

I can't give much explanation as to why this is exactly, but it has been my observation (periodic performance testing during the ongoing compiler-writing task...).

my guess is because these things are heavily optimized, given that much existing x86 code uses them heavily (this may change in the future though, as 64 bit code becomes more prevalent...).


my guess is that the calling convention was designed according to some misguided sense of "optimization wisdom", rather than good solid benchmarks.

better performance could probably have been achieved at present just by pretending the x86-64 was just an x86 with more registers and gueranteed present SSE.

not only this, but the convention is designed in such a way as to be awkward as well, and leaves open the question of how to effectively pull off varargs...



or, at least, this is what happens on my processor (an Athlon 64 X2 4400+).

I don't know if it is similar on Intel chips.


***

.