The Case Against RosAsm (#7) (LONG)
From: Randall Hyde (randyhyde_at_earthlink.net)
Date: 01/13/04
- Previous message: The Half A Wannabee: "Re: OMG"
- Next in thread: Betov: "Re: The Case Against RosAsm (#7) (LONG)"
- Reply: Betov: "Re: The Case Against RosAsm (#7) (LONG)"
- Reply: Gerhard W. Gruber: "Re: The Case Against RosAsm (#7) (LONG)"
- Reply: Frank Kotler: "Re: The Case Against RosAsm (#7) (LONG)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 13 Jan 2004 06:18:26 GMT
RosAsm Can't Optimize
One of the reasons people write assembly code in
the first place is to produce efficient code.
"Optimizers will never replace hand optimized
code" is a common cry of assembly programmers
everywhere.
While this is certainly true, there are certain
optimizations that an assembler (or compiler) *can*
do that, even if they could be done just as well by
human beings, they're much better done by machine
as the optimizations are rather tedious to perform
and very mechanical in nature (i.e., perfectly
suited for a machine to do).
"Branch Displacement Optimization" is one of these
types of optimizations. I've written a separate
essay, which I've attached to the end of this post
(from the assembler/compiler technology area on
the MASM Forum) that explains the issues with
Branch displacement optimization, so I won't go
into the details here.
The first assembler I recall that provided
branch displacement optimization was (IIRC) a
product called "OptAsm" produced by SLR systems.
This was a MASM 5.1 compatible assembler available
in the late 1980s. It was very fast (faster than
TASM) and supported some really neat features like
branch displacement optimization. TASM and MASM
followed suit in the early 1990s. Today, most
commonly-used assemblers (e.g., MASM, TASM,
FASM, Gas, and HLA) support branch displacement
optimization. Even NASM, whose original authors
resisted the notion of this optimization, has
been updated to include this important feature
(thanks to Frank Kotler).
Here's what Rene has recently said about this feature:
>>>>> Rene Tournois:
About the jmps Size Optimizations, this is
a design choice. Given your competency level
in Assemblers design, i cannot rewrite again
and again what i wrote up there to explain to
readers what this design choice is for.
<<<<<<<
In the past, Rene has lamented the fact that
FASM has introduced this feature and that it
really should be something the assembler forces
the user to do because it will make that user
a better programmer and help them write better,
bug-free code (though how forcing the user to
endure the tedium of manually fixing out of range
branches reduces bugs is beyond me).
Every time this subject has come up, Rene has
stubbornly refused to even consider the fact that
the machine is better suited for branch optimizations
than the human programmer.
But good news! Rene seems to be getting attacks
from his own users now. And despite his claims that
he doesn't care whether anyone uses his product or
not, the fact that people are *rejecting* his product
because it doesn't do branch optimization is finally
starting to sink in. Here's a post of Rene's from
the RosAsm Forum:
>>>>>> Rene Tournois:
It seems that some assembly programmers consider
the fact of having the assembler automatically
optimizing the jumps sizes (JMP / Jcc Short/Long),
is an important point when choosing what Assembler to use.
My choice when designing RosAsm, was that the main
important point, with this, was to run an error case,
when a wanted short jump was longer than required,
because, forgetting to write some local Label Declaration
is a very usual error, we all do, and because such an
error, if not pointed out by the Assembler, at
Compile-Time, may be a real hell to point out,
with Run-Time Debugging.
So, is the Short-Long dilemna implemented in
RosAsm, and i will certainaly not kill such an
important advantage.
Nevertheless, given the actual implementation,
it remains perfectely possible to have the Assembler
optimizing the jumps size, in the "Long > short"
direction, what is the only interresting point, in
the optimizing approach.
This can be done a very simple way, without
changing anything at the syntax, and at the
way the errors Manager preverses us from bugs,
by runing the Encoder twice, if some Long jumps
are found be under the Byte Size limit. This could
be done, at the user point of view, by inserting
one another Item, in the Menu, saying [Compile and
optimize JMPs Sizes], or something like this.
<<<<<<<<<
Yeah! Rene is actually responding to demands from
the user base. Yeah! Rene is finally coming around
to realizing that Branch Displacement optimization
is something people demand. Can library support be
too far off?
The only scary thing is that if you know anything
all about branch displacment optimization, you'll
realize that Rene's discussion of the solution
(above) won't work. First of all, it can't be
done by running the "encoder" (code generator)
twice. Optimization is a multi-phase operation
(described later in this post, for Rene's
benefit). Second of all, Rene's refusal to
even *study* anything related to compiler theory
pretty much means he'll ignore the little essay
on this subject I've prepared and go off about how
I am too "incompetent" to understand the issues
(despite the fact that I *have* taught compiler
construction courses before). Oh well, his loss.
Maybe someday RosAsm will actually support
Branch Displacement optimization. Probably about
the same time Rene delivers on his promise to
support conditional assembly :-) (i.e., in a
couple of years). I sure hope he grabs a copy
of Tomasz' FASM assembler and studies that code
first. Otherwise, it's doubtful (based on the
bits and pieces of RosAsm source I've read)
that he's going to do a very successful job
of this.
You've got to ask yourself, "Do I want to use
an assembler that supports Branch Displacement
Optimization today? Or do I want to wait for
Rene to get his act together and add it to
RosAsm, maybe two years from now?"
Cheers,
Randy Hyde
==================================================
This is a short essay that attempts to explain the
basics of "displacement optimization" in machine
code such as on the x86. In posts I've made here
and on various newsgroups (comp.lang.asm.x86 and
alt.lang.asm) it is quite clear that many people
don't understand what this is all about, and even
those who do understand the basic issues can't
believe the problem is as difficult to solve
(optimally) as I claim that it is. In this essay
I'm hoping to explain the process and the problems
in such a way so to make all this clear.
What is This All About?
The first place to start is with the question
"what is branch displacement optimization?" On
certain CPUs (e.g., the x86), certain instructions
have a limited branch range because of the way the
CPU encodes the target address of the branch. For
example, a typical "JE" instruction on the x86 is
two bytes long - one byte for the opcode and one
byte for a relative displacement. This relative
displacement allows the instruction to transfer
control to some instruction within +/- 128 bytes
(roughly) of the JE instruction.
What happens if the target address is *not* within
range? Well, if the target address is within +/-
32K bytes, the 386 and later devices have a
special four-byte version of JE (two-byte opcode
and two-byte displacement) that can transfer
control to the label. If you're operating on a
chip earlier than the 386, you have to emit a
five-byte sequence like the following:
jne skipJmp ;two bytes
jmp target ;three bytes
skipJmp:
On any of these processors, if you want to jump
beyond a +/- 32K range, you have to use the above
sequence with a five-byte version of the JMP
instruction (bringing the total to seven bytes).
In early x86 assemblers (and in a few assemblers
that have been created lately), it was the
programmer's responsibility to manually encode the
displacement sizes of these jumps. Besides being a
pain in the *** to do manually, it was often the
case that the programmer *missed* several
optimization opportunities (hey, who wants to
count bytes in a program) and so the program was
not as optimal as it should have been.
It doesn't take a huge intuitive leap to realize
that this kind of grunt work is *exactly* the kind
of thing computers were designed to handle. As
such, in the late 1980's, various assemblers
started appearing that automatically handled the
correct branch displacements (in most cases,
anyway). I don't personally recall the name of the
product, but the first time I saw this feature was
in a MASM 5.1-compatible assembler (not MASM
itself); TASM had the feature next. I believe
Microsoft added this feature in MASM v6.0. Today,
assemblers like Gas, FASM, and NASM all suport
this feature (and probably others too, I haven't
looked that closely).
Why is this optimization even necessary? Well,
it's pretty obvious that if the target location of
a jump instruction is within +/- 128 bytes, you're
wasting two or more bytes if you're not using the
smaller version of the jump instruction.
Considering that most branches turn out to be
within this range and there are generally quite a
few branches in a program, just using the long
version of each branch can make the program quite
a bit larger than it needs to be.
The Problem
"So what's the big deal?" You're probably asking.
"Why not just scan through the file during
assembly and determine the distance to the target
location and pick the appropriate size for the
branch instruction?"
The problem is, the size of a given branch
instruction may determine the size(s) of some
other branch instruction(s). Consider the
following simple code sequence:
jmp target
<exactly 125 bytes of code>
jmp someOtherTarget
target:
If "someOtherTarget" is within the +/- 128 byte
range of the second jmp instruction above, you can
encode it with only two bytes on the x86;
otherwise you will need at least four bytes. If we
do encode this second jump instruction in two
bytes, then we can encode the first jump
instruction above in two bytes (the actual range
on the x86 is 127 bytes starting with the *next*
instruction after the conditional jump, or -128
bytes before the next instruction following the
conditional jump). OTOH, if the second jump
requires more than two bytes, so will the first
jump. Consider the following degenerate case:
jmp l1
<<125 bytes of code>>
jmp l2
l1:
<<125 bytes of code>>
jmp l3
l2:
<<125 bytes of code>>
jmp l4
l3:
.
.
.
etc.
The *last* jump in this sequence controls the
sizes of all the other jumps in the code! This is
because if the last one is a two-byte opcode, then
the next-to-last one can be two bytes, then the
one before than can be two bytes, then the one
before that...
Unfortunately, there is no easy way (and soon,
you'll see, efficient way) to determine the size
of these jump instructions in a single pass over
the object code the assembler generates. A typical
assembler will do the following:
1. Scan over all the opcodes and determine if any
branches can be reduced in size.
2. If the answer is yes, the assembler reduces the
size of those branches and adjusts all other
instructions whose target addresses and other
values have changed as a result of reducing the
size of the branch instructions.
3. If the answer was yes, repeat steps 1-3 until
you make a pass with no optimizations occuring.
Note, that in the worst case, this algorithm runs
in O(n^2) time (that is, the amount of time it
takes, worst case, is proportional to the square
of the number of branches in the program; double
the number of branches, it takes four times as
long to process them). This isn't very good. But
it isn't terrible, either.
Though the algorithm above looks easy enough, it
turns out to be insufficient. Consider the
following trivial code sequence:
target2:
jmp target1
<< 124 bytes of object code>>
jmp target2
target1:
If we start off assuming that the branches are
long (four bytes each), then a quick pass through
this section of code will suggest that
optimization is not possible. However, were you to
arbitrarily select one of these jumps and make it
a short jump, the other jump would also be short
and both jumps would be within range. Therefore,
an assembler that starts off using large
displacements and shrinks them down can miss
optimization opportunities such as this one.
Another possibility is to start with all short
displacements and expand them as necessary. The
reason many assemblers start with long
displacements and shrink them is for reasons of
robustness -- the code will work, even if it's
less optimal, if a defect causes you to miss an
optimization that could have been introduced into
the object code. OTOH, if you start with short
displacements and extend the ones that are out of
range, a defect in the "correction" process
results in defective code generation. As it turns
out, optimization isn't guaranteed when you start
small and work big, either. So those assembler
authors who choose the robust aren't necessarily
missing out on something.
OTOH, *most* branches are short to begin with, so
starting off with short displacements and
increasing the size of those that are out of range
is going to be faster (O(n^2) >> O(m^2) if n>m).
The hard part for people to believe is that
starting small and working big doesn't necessarily
guarantee an optimal sized object file. In fact,
and this is the part that really blows people
away, sometimes the smallest object file is
achieved by making certain branches larger that
could have been encoded as short branches!
Here's a short example of some MASM source code
that demonstrates this problem:
.386
.model flat, syscall
.code
_HLAMain proc
jmpLbl: jmp near ptr target
jmpSize = $-jmpLbl
byte 32 - jmpSize*2 dup (0)
target:
_HLAMain endp
end
Here's the assembly listing:
Code:
.386
.model flat, syscall
00000000 .code
00000000 _HLAMain proc
00000000 EB 1C jmpLbl: jmp target
00000002 = 00000002 jmpSize = $-jmpLbl
00000002 0000001C [ byte 32 - jmpSize*2 dup
(0)
00
]
0000001E target:
0000001E _HLAMain endp
end
Note that the "object code" is 30 bytes long
(1Eh). Also note that the jump instruction is two
bytes long (that is, it's a short jump).
Now look at what happens when we force the jump to
be a five-byte jump:
.386
.model flat, syscall
.code
_HLAMain proc
jmpLbl: jmp near ptr target
jmpSize = $-jmpLbl
byte 32 - jmpSize*2 dup (0)
target:
_HLAMain endp
end
;;; Listing:
.386
.model flat, syscall
00000000 .code
00000000 _HLAMain proc
00000000 E9 00000016 jmpLbl: jmp near ptr target
00000005 = 00000005 jmpSize = $-jmpLbl
00000005 00000016 [ byte 32 - jmpSize*2 dup
(0)
00
]
0000001B target:
0000001B _HLAMain endp
end
Note that although the jump is now *five* bytes
long (rather than two), the object module is
actually *shorter* (27 bytes [1Bh] rather than
30).
Therefore, you cannot guarantee that you've got
the shortest possible program by simply making all
the jumps as short as they can be and letting it
go at that.
Unfortunately, as this last example demonstrates,
it's quite difficult to determine the optimal
selection of short and long branch instructions in
order to optimize your code. In fact, there have
been some mathematical proofs that show that this
problem is "NP-Complete", which means that the
only way to guarantee you've got an optimal
solution is to try all possible combinations of
short and long displacements in the object file
and select the (or one of the) combination(s) that
is the shortest. Unfortunately, this is an
intractible problem. The algorithm given earlier
runs in quadratic time (O(n^2)), which isn't
great, but still tractible. The best known
algorithms for NP-Complete problems run in
exponential time (i.e., O(2^n) where 'n' is the
number of branches in the program). This means
that adding a *single* branch doubles the amount
of time needed to determine the optimal output
module. To put it bluntly, no computer available
today and, indeed, no computer we can ever imagine
building will be fast enough to handle the
branches found in a reasonably-sized application
program.
Yet assemblers available today *do* perform branch
displacements. So how come they don't run real
slow? The answer is that these assemblers *don't*
guarantee that you'll get an optimal program.
AFAIK, for example, none of them handle the last
case I gave. Most of them simply use the "start
large, work small" or "start small, work large"
algorithms given earlier (for those who have had a
data structures class/algorithms analysis course,
you can think of this as a form the "greedy"
algorithm). As such, while they produce *good*
code (typically better than what someone will
produce by hand), they do not necessarily produce
optimal code.
In fact, some assemblers limit the amount of
optimization they do in order to keep assembly
times low. TASM is famous for this; TASM has the
/M# directive that lets you specify the maximum
number of passes (this is, generally, done to
actually *increase* the default number of passes
that TASM performs). I don't have any information
on MASM. However, the fact that FASM usually
produces better code that MASM suggests that MASM
stops at one point or another, before FASM does. I
have no information on how NASM or Gas handles
branch displacement optimization, so I cannot
comment on their performance.
That's not to say that there is anything wrong
with the way these assemblers do their
"optimization." Most of the time they probably do
produce the optimal program, and when they don't,
they produce something very close to it. That's a
reasonable trade-off given the intractible
alternative.
Although branch displacement optimization is a
tedious chore and it's difficult to add to an
assembler, it is a prime example of one of those
things that computers are much better suited for
than human beings. A decade and a half ago it
might have been reasonable to expect the
programmer to manually take care of branch
displacement optimizations. But given the fact
that most reasonable assemblers today offer this
facility, it's moved into the "must-have" category
for anyone developing a serious assembler.
- Previous message: The Half A Wannabee: "Re: OMG"
- Next in thread: Betov: "Re: The Case Against RosAsm (#7) (LONG)"
- Reply: Betov: "Re: The Case Against RosAsm (#7) (LONG)"
- Reply: Gerhard W. Gruber: "Re: The Case Against RosAsm (#7) (LONG)"
- Reply: Frank Kotler: "Re: The Case Against RosAsm (#7) (LONG)"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]