Re: LOOP - Why so slow?



On Aug 5, 2:32 am, Richard Russell <spamt...@xxxxxxxxxx> wrote:
On 5 Aug, 01:10, Tim Roberts <spamt...@xxxxxxxxxx> wrote:

Compiler writers would rather generate simple instructions instead
of complicated ones with multiple side effects, so "loop" wasn't
getting used in compiled code. Because of this, Intel chose to
optimize the simpler, core, RISC-like instructions

Actually, (refering to Tim), I have seen compilers emit this
instruction. And for reasons having nothing to do with creating a
"loop" (e.g., creating a switch instruction by emitting a sequence of
loop instructions that branch forward).



An interesting theory (no more than that, unless you have evidence).
It may well be true that compilers rarely emit the 'loop' instruction
but I would speculate that it has nothing to do with it being
"complicated" or having "multiple side effects" (neither of which is
true). Much more likely, I would say, is that it's because it has a
maximum jump displacement of -128 to +127 bytes.

This is no big deal for a compiler. After all, don't forget that until
the 80386 came along, compiler writers had to do this for *all*
conditional jump instructions. True, there is no easy way to
synthesize a "long loop" instruction (as you could for Jcc
instructions), but a compiler could easily switch to a different
sequence once it determined that the branch would be out of range.


A compiler, which is outputting assembler source code, doesn't want to
have to worry about whether the distance from the 'loop' to its
destination will be more than this limit, especially as the whole
point of generating assembler source rather than machine code is to
hide low-level details of this kind.

Compilers already have to worry about this. Have you forgotten the
conditional jump instructions? The compiler has to emit a different
instruction sequence based on the distance to the target location. The
LOOP instruction is no different in this respect, other than you have
to emit a "dec ecx, jnz target" sequence rather than a long jump (and
deal, possibly, with the different side effects). Again, not a problem
for a modern compiler.



It's also worth noting that DEC ECX:JNZ has *more* side effects than
LOOP, in that it affects (only) the Z flag. If you look at Intel's
optimisation guide you will see that it recommends not using DEC at
all, but substituting SUB ECX,1 "because add and sub overwrite all
flags, whereas inc and dec do not, therefore creating false
dependencies on earlier instructions that set the flags".

Not being a hardware engineer, I can't tell you if this is marketing
BS that Intel made up. However, it's pretty clear that Intel is
deprecating INC and DEC and we can expect them to fall behind in the
performance department, just like LOOP. However, this is not an
explanation of why they haven't bothered to improve the performance of
LOOP. It's just another example of Intel deprecating some
instructions.




As a final counter-argument, although 'loop' may not be used by
compilers, it is (or would be, if it was fast) by assembly-language
programmers who want the most compact and/or fastest code.

Actually, it *is* used by compilers and assembly language programmers
who want compact code. It's the "fastest" code that is the problem.

Assembly
language code may be a small proportion of code written, but it is
still used where the very best performance is required. Therefore it
would be strange for Intel to choose deliberately to slow an
instruction which might be used in the most time-critical code of all.

While I agree that LOOP shouldn't be so slow, what makes you think
that LOOP would be a great choice for time critical code? Even if it
ran at the same speed as dec/jnz, how would that make much of a
difference (ignoring cache effects of the extra byte, of course).

The big mistake with LOOP is that it should have been a DJNZ
instruction like the 68000 or the ACB (add compare and branch)
instruction of the 32000, that would allow you to decrement *any*
register, not just ECX.

Personally, I wish they'd gotten rid of most of the depricated
instructions when they moved over to the 64-bit version of the chip
and reused all those single-byte opcodes for more important things. I
realize the need for software compatibility, but in 64-bit mode they
didn't need to worry about this. Indeed, it's too bad they didn't just
redesign the ISA for 64-bit mode and use different opcodes (to make
often-used instructions smaller and eliminate the old 8088
instructions that few people use anymore). Yeah, decoding is probably
easier by keeping the old encodings, but given the way they handle the
stuff these days (uOps), I suspect that's not much of an issue.
hLater,
Randy Hyde

.



Relevant Pages

  • Re: Measurement Accuracy & ANOVA
    ... the number of instructions that had to be executed. ... and in the compiler -- reduce the worth of those direct measures, ... done in that loop that has any variables that are used later, ... For a practical benchmark, you need something that looks like a ...
    (sci.stat.math)
  • Re: Optimization
    ... For the predictor to ... Intel has *EMBARASSED* Microsoft with its truly amazing compiler. ... but I don't remember about other new FP instructions. ...
    (comp.programming)
  • Re: LOOP - Why so slow?
    ... I never claimed LOOP increasing was any sort of conspiracy. ... I doubt Intel ... of the more complex instructions. ...
    (comp.lang.asm.x86)
  • Re: Seymour remembered
    ... > manage to get a loop into this buffer, ... I wasn't certain from the article which Fortran compiler they were ... Instructions were 15, 30, or 60 bits long so theoretically the stack ...
    (comp.lang.fortran)
  • Re: AMD vs Intel for video format conversions and editing
    ... Linux) compiler rather than the Intel one. ... basic x86 code rather than the fancy high-performance extended x86 ... they've spent a lot of effort in getting their basic x86 instructions ...
    (comp.sys.ibm.pc.hardware.chips)