Re: In search of the perfect Disassembler




Alex McDonald wrote:

| ===code snipped
| There's a great deal of confusion going on here, with both sides
| misunderstanding each other. The halting problem is to do with the
| _inspection_ of an algorithm. Given any arbitrary algorithm, there is
| no algorithm writable that can _inspect_ it and say definitively that
| it will halt.
|
| It is not difficult to disassemble this example above. Charles Crayne
| presented the simplest example,
|
| label: jmp label
|
| and that's trivial. So what does a disassembler do? Charles pointed out
| one analogy that's worth exploring, that of lossy jpeg compression.
|
| To get a bit more formal, assume that an assembler source program is a
| string S. (This could be a picture in bmp format). Then an assembler is
| a transformation function f that f(S) -> T, where T is the code (or the
| jpeg). There are possibly an infinity of S that transform to a single T;
| think of all the different types of assemblers (or compilers) that
| produce the same T from different source representations. There is
| therefore a loss of information content from S to T.
|
| A disassembler is a transformation function g(T) -> S'. There are many
| S' for a specific T; we may generate C, or assembler for RosAsm or HLA
| or...
|
| However, there should be no loss of information from T to S'; in
| other words f(g(T)) -> T; a disassembled program T should be compilable
| back to T. For example, the simplest example would be a disassembler
| generating a string of dbs for MASM from T; this will assemble under
| MASM to T. (RosAsm seems to fail on that score; it's broken by this
| definition).
|
| But there's no perfect disassembler, that is, one that has the property
| g(f(S)) -> S where S is the original assembler source. To do that
| requires that there is no information loss from S to T, that is, there
| is only one S that will generate a given T. Even though we restrict S
| to a single language, say RosAsm, there are many ways to specify the
| same T. (There may be languages where there is only a single S for each
| T; they would probably be unusable due to their lack of expressive
| power).

Why easy when it can be made complex ? :)

| Back to the halting problem. Can an assembler suffer from it? Certainly,
| with an x86 architecture, yes, if f(S) -> T, T', T'', ... This is the
| case of optimising JMP and replacing long jumps with short jumps. There
| is no algorithm that can inspect S and guarantee that it will halt
| having generated the optimal T (say, the shortest). All assemblers that
| do this run a number of passes, and give up after a maximum number of
| passes.

| Our disassembler suffers from the same fate. Here's an example;
|
| proc: call label1
| ret
| label1: pop eax
| jmp [eax]

Ok.
my disassembler (KESYS/HEXEDIT 1994) produces this info in the first
pass already: (if set to branch oriented mode, of course)

xx00 e8 01 00 00 00
1.[esp]=xx05 ;eip=xx06; esp[xx05]=e0ff58c3 ;esp-4
xx06 58
2.eax=[esp]=xx05; esp+4 ;eip+bytecount (+1 yet)
xx07 ff e0
3. eip=eax=xx05; SET redundant-flag (next = target)
xx05 c3
4.eip=[esp-4](from previous);esp+4 "EAX=xx05"=label1

[exactly the same way as my old Z80 ROM-burner (1979) disassembles,
I can't remember who produced it, it's still somewhere in my garage]

Even it will brainless follow all the detouring steps and
not seeing it's just a single return, it wont fail to
correct follow the code flow here without any need to run it.

I use the shorter (even it's breaks the call/ret pairing):
call L1
L1: pop eax
; ret ;and I continue right here

btw, I'd recommend the faster:
call L1
L1: mov eax,[esp]
ret

I only encounter problems when hardcoded (FAR)jump/calls
to outside of the known (like winAPI and similar) occur.

| Question; is the ret code or the value 0xC3? Is it dead code, or
| executed? We can only determine that correctly _by running it_.

See above.
If the returned value becomes data or remain a code label
depends on following actions.

| And before you think that you inspected the code, you didn't;
| you ran it in your head. Only by running it can you determine
| that it's a ret. You might think you can simulate the running;
| that is, try to sort of pseudo execute it. That's just like running
| it, only slower.

Here our opinions and experience become apart very far.

| A static inspection won't do. The best static inspections will require
| heuristics to improve their "goodness", and there are (like viruses)
| just too many possibilities to get close to 100%.

| So we have to run it to get closer to perfection; and if it never
| halts... The halting problem; given any arbitrary algorithm, there is no
| algorithm writable that can _inspect_ it and say definitively that it
| will halt.

Again:
if one of the code-branches end in an inderminant, perhaps endless
loop, or jump to locations not covered/outside/unknow/ then I consider
to mark this path as 'possible end' and continue with the rest of code.
What can me stop on a halt-sign ? :)

I wouldn't run a code just to see how it does if I had a smart
tool which can tell me ahead that it will hang. :)
My code analyser will never run the code in inspection,
it will track all values and calculate on DATA and CODE block
ranges and may detect complex functions rather than viewing
just single code lines or stop on indeterminant parts.

If the disassmbler sees any code part twice than it may continue
with the next path or elsewhere, there is no reason to stop when
a possibly endless loop is found.

Sure there are some tricky code parts which can exceed the power
(storage dept) of an analysing tool easy, but this are rare.
Self-modified/self-extracting/.. code may need several copies of the
whole story before it can succed.

| So how good can we get? I'm not sure what metrics we could use for
| "goodness" when it comes to disassembly. Certainly, a string of dbs
| seems inadequate. But perfection is not possible.

The only problem for "perfect" I see, is the lack on a doubtfree
syntax for long/short/immediate size/.. on all assemblers/compilers.

But to achive identical function by recompilation I can live
with a re-compiled file different in size by a few bytes as a
"not perfect" "but complete" solution.

__
wolfgang



.



Relevant Pages

  • Re: In search of the perfect Disassembler
    ... Given any arbitrary algorithm, there is ... Then an assembler is ... A disassembler is a transformation function g-> S'. ... A static inspection won't do. ...
    (alt.lang.asm)
  • Re: maz v. taz
    ... Master the Power Basic Programmer! ... difficult and inconvenient to use a separate editor, ... Those who want to use your assembler (why? ... disassembler didn't live up to expectations, ...
    (alt.lang.asm)
  • Re: The fun goes on
    ... Why are you so obsessed with hating Randall Hyde? ... resulting in people like me refusing to read your RosAsm documentation. ... > for, say around 3 years of deliriums about "The Perfect Disassembler", ... > explained how to write an Assembler, by the guy who never wrote ...
    (alt.lang.asm)
  • Re: Theoretical Computer Science and Disassemblers
    ... Detecting infinite loops is an undecidable problem. ... The "halting problem" has been ... guaranteed to halt and produce an appropriate response (generally "yes" ... view of a disassembler, this means that it is quite possible that your ...
    (alt.lang.asm)
  • Re: Theoretical Computer Science and Disassemblers
    ... > I could not solve the halting problem, and this never was my attempt. ... > And as I myself wouldn't hop in circles to see when/if a loop terminates, ... You can return false positives (claiming it will halt when, ... > At the moment my disassembler got a 'branch oriented' option. ...
    (alt.lang.asm)

Loading