Re: Relocatable/PIC asm (was intra-segment CALL and JMP)



"Hugh Aguilar" <hughaguilar96@xxxxxxxxx> wrote in message
news:1de1151a-e747-42ea-a545-cecadb988b75@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
[...]
I'm only writing in x86 assembly right now,
because I need a Forth that runs on the desktop computer. It will be
used to host a cross-compiler that will generate code for micro-
controllers. The Forth running on the desktop computer doesn't have to
be particularly fast because cross-compiling isn't a time-critical job
(I don't know of *any* desktop application that is time-critical). I'm
just complaining about the x86 register shortage because it is a
hassle to not have enough registers and to have to continually juggle
data back and forth between registers and memory. It is more of a
convenience issue than a speed issue though --- you are certainly
correct that the x86 is hugely fast. So long as I make at least a half-
way effort at efficiency, the x86 Forth should be more than fast
enough for what it is being used for.


As you should know from c.l.f., there are quite a few Forth's available
for x86 already, from modern, compilable ANS to ancient fig-Forth.
The Forth Interest Group has a decent archive:
http://www.forth.org/

Is it true that indirect addressing is shorter and faster when the
EAX is used, rather than another register?

Some instructions have a shorter, i.e., more compact form, that uses AL or
AX or EAX.

Are the AX forms any faster? I don't think so, but I'd have to look that up
....

'reg, reg' , 'acc, imm' , and 'reg, imm' are the fastest instruction forms.

The more complicated the instruction or the more work it has to do, the
longer it'll take. However, that could be shorter than multiple
instructions, especially if the instruction is "pairable", i.e.,
pipeline-able, or partially so. Unfortunately for x86, only a large part of
the 1 cycle and a small part of the 2 cycle instructions seem to be
"pairable" or partially so. 3/4/5/+ cycle instructions are generally not
pairable. I.e., shorter, faster instructions are probably the way to go in
general, given the timings and pipeline-ability. It's possible the
RISC-like sequence could be far worse, so you'll want to check. If the
timing is less, same, or slightly slower than a high cycle instruction, then
the RISC-like sequence should be faster when pipelining is taken into
account.

I'm using EAX as my top-of- stack on the assumption that
fetching and storing through it will be faster.

I'm not sure if that is true or not. I don't have any mention in my notes
that the shorter AX forms are faster. I don't recall seeing anything
related in the timing sections of the manuals or in the optimization
manuals. If I did, I'd go look it up.

It is true that early tests (Koopman and Ertl separately) on Forths that
kept one (TOS) or two (TOS, NOS) of the top stack items in registers were
faster than pure stack Forths. Keeping the top two stack items in registers
for x86 will shorten the x86 instruction sequences for some Forth words,
e.g., SWAP TUCK NIP much shorter, ROT slightly shorter. However, keeping
the the top two stack items in registers makes other sequences longer, i.e.,
slower, e.g., DUP DROP OVER slightly longer. So, the instruction execution
frequency comes into play. Koopman and Ertl list SWAP, DUP, and ROT as high
execution. ROT is bad either way: pure stack or 2 top stack items in
registers. The x86 does not like stack shifts: ROT, ROLL etc. SWAP is bad
for pure stack, excellent for 2 top stack items in registers. DUP, DROP,
OVER become slightly worse for 2 top stack items in registers, but SWAP
becomes much better.

On the old 16-bit systems, BX was typically used because it
supported indirect addressing and AX didn't. Am I correct that EAX
is the best choice nowadays?

.... asked the taxi driver ...

BX was used because 16-bit address modes only supported address combinations
using a few registers (BX, BP, SI, DI). SP was not one of them, so BP was
used as to access the stack after being set to SP's value at some point. So
for HLLs, BP was typically used a frame-pointer for the CDECL calling
convention. 32-bit address modes support all base registers except ESP in
the same manner with some limitations. The 32-bit address modes also have
the SIB modes that support all registers.

Interesting read about BP usage:
http://blogs.msdn.com/b/larryosterman/archive/2007/03/12/fpo.aspx

Also, what is your opinion of the string instructions involving EDI
and ESI? They were very important back in 8086 days. Should
they be used now, or is hand-written code faster?

According to the timings in the manuals, they are slower on modern x86.
Because they are "complicated" instructions, they can't be or just aren't
pipelined.

Is it also true that hand-
written code is generally faster than LOOP?

Loop has the same issue on modern x86 as the string instructions. So,
most likely ... It took alot of cycles 5/6/11+. It probably still does.

I'm using the CISC instructions mentioned above in my Forth
because they are convenient. Is there a performance penalty
for doing this?

Yes. You should use the more RISC-like x86 instructions today.

The optimization manuals say coders aren't using x86 instructions with
memory operands enough, i.e., 2 cycle pairable. x86 programmers are moving
data into registers or the stack, then manipulating them, and then moving
them back to memory. Instead, they should use an instruction which modifies
memory data directly.

Unless the speed penalty is egregious, I'll just ignore the issue
and aim for convenience.

The old CISC style x86 instructions can't be pipelined. So, it slows down
the execution. How slow depends on how many instructions a specific
generation of x86 can pipeline at once. I.e., more and more as time passes
....

Certain CISC combinations are still faster than non-CISC, e.g., repeat
prefixes when used with certain string instructions and I/O instructions.

Even so, those instructions are still compact and effective at what they do.
I.e., using them may save coding time, programming errors, or
overall complexity. If they're low-use, they won't have much impact on
overall speed of the code. I mention that because for your Forth
interpreter, you've likely got a set of "primitives" or low-level functions
that do most of the work. Of those, there will be a few that are executed
alot, i.e., high execution frequency. Those are what will need to be "fast"
for either the host or target.


Rod Pemberton



.



Relevant Pages

  • Re: Relocatable/PIC asm (was intra-segment CALL and JMP)
    ... addressing modes. ... I don't want to get into a big debate about the x86, ... data back and forth between registers and memory. ... what is your opinion of the string instructions involving EDI ...
    (alt.lang.asm)
  • Re: Relocatable/PIC asm (was intra-segment CALL and JMP)
    ... I am mostly only interested in micro-controllers. ... The x86 has ... I'll continue to use the string instructions --- they really are ... the register usage so that register contention doesn't ...
    (alt.lang.asm)
  • Re: Why is the Sony Playstation 4 x86?
    ... instructions to complete typical operations in ARM (vs x86). ... uops to do things like load constants into registers though... ... the ISA used by the compilers, like say, being either fully, or at least ...
    (comp.arch)
  • Re: Two Click disassembly/reassembly
    ... Map the extra x86 registers to memory. ... > equivalents to the string instructions. ... > got such a limited RISC like instruction set that the assembler is more ...
    (alt.lang.asm)
  • Re: Why is the Sony Playstation 4 x86?
    ... instructions to complete typical operations in ARM (vs x86). ... uops to do things like load constants into registers though... ... the ISA used by the compilers, like say, being either fully, or at least ...
    (comp.arch)