Re: No need to optimize in assembly anymore
From: C (cc-news_at_hermes.mirlex.com)
Date: 05/18/04
- Next message: Herbert Kleebauer: "Re: No need to optimize in assembly anymore"
- Previous message: Bryan Parkoff: "Re: DirectX in HLA"
- In reply to: Matt Taylor: "Re: No need to optimize in assembly anymore"
- Next in thread: Matt Taylor: "Re: No need to optimize in assembly anymore"
- Reply: Matt Taylor: "Re: No need to optimize in assembly anymore"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Mon, 17 May 2004 22:54:31 +0000 (UTC)
Matt Taylor wrote:
> "C" <cc-news@hermes.mirlex.com> wrote in message
> news:zb2qc.88$uu2.24@newsfe2-gui.server.ntli.net...
>
>>Partially. If you are talking about cycle counting then yes,
>>because these counts are non deterministic (due to out-of-order
>>processing) and inconsistant across different processor
>>generations and manifacturers (due to different goals in
>>choosing hardware optimisations).
>
> <snip>
>
> Cycle-counting isn't quite that non-deterministic. The code usually
> falls into the same cadence regardless of the initial state upon entry
> due to dependencies.
You are correct: I over simplified there. Though for most purposes,
especially when considering other or future processors, the exact
time for a given sequence is too difficult to determine accuratly.
Essentially, though one can get a rough idea of how well a peice of
code will perform vs. a similar peice, it is often little more than
an educated guess -- especially if one is considering whether a
minor replacement / reshuffle will produce an improvement.
> Cache misses are unpredictable, but there isn't really
> anything you can do at that level to avoid them.
Yes, though we are not totally hopeless there either, provided
the algorithm requires multiple accesses to memory. In these
cases you can try to localise memory accesses to increase the
probability of a cache hit. This is normally most effective
in loops which process large amounts of data, for example an
implementation of the FFT. Similarly, aligning data can have
a similar effect on improving the probability of a cache hit
when multiple accesses must be made to the same data structure.
> Code scheduling tends to improve performance across all architectures.
> Even heavily pipelined machines like the Pentium-IV with massive
> capacity for in-flight ops see improvement when poorly-scheduled
> code is optimized in this fashion.
Yes, the P4 trace cache is an interesting concept -- it is a
pity other hardware constraints (such as it only having a
single decoder for input, or the slow shifts) reduced its
effectiveness and therefore the potency of the processor
overall.
> Out-of-order processing helps to hide the differences between
> CPUs, but it doesn't make a very good crutch.
True, this, of course, being due to the limited look ahead in OoO
hardware. Though there is a big difference between unoptimised
code and well optimised code -- OoO does reduce the difference
between lightly optimised and heavly optimised code considerably.
Indeed, having OoO hardware is _definatly_ not an excuse to
completely avoid optimisation, only to avoid squeezing every last
cycle out of the code. (As that style of optimisation, while
taking the most programmer effort, is mitigated against by the
OoO hardware.) [Though in some cases, such as heavily used
inner loops, such optimisations may be justified despite their
inherent non portability.]
> Cycle-counting is also useful since most modern processors have similar
> weaknesses and strengths. Multiplies & shifts are a classic example;
> convert a constant divide to a constant multiply, and some constant
> multiplies will convert into shifts. Pentium-IV is a little bit
> different, but otherwise x86 processors generally favor the same
> simple operations.
Yes, that was an aspect I did not address in my post, primarily
because compilers and some assemblers will do this automatically.
And I guess that doing strength reduction has become so second
nature to me, even in HLL programming, that I no longer realise I
am doing it at all :-)
I also failed to mention those algorithmic implementations which
only are apparent / possible in assembler such as extended precision
arithmetric. (Something I note you address elsewhere in this thread
and indeed, I too have dicussed it recently [alt.lang.asm].) Well,
'tis either miss a few details or do a Beth style 'postius maximus'.
:-)
Anyway, I think that the replies here have mostly disproved the
original poster's hypothesis. Being as more reasons have been given
than I would have not thought of immediatly, this is turning into a
fairly interesting thread.
C
2004-05-18
- Next message: Herbert Kleebauer: "Re: No need to optimize in assembly anymore"
- Previous message: Bryan Parkoff: "Re: DirectX in HLA"
- In reply to: Matt Taylor: "Re: No need to optimize in assembly anymore"
- Next in thread: Matt Taylor: "Re: No need to optimize in assembly anymore"
- Reply: Matt Taylor: "Re: No need to optimize in assembly anymore"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|