Re: General structure of disassembler



[Jongware] wrote:
Last night I wrote ...


I have a lot of hobbies, and one of these is writing disassemblers for
various CPUs and data file formats. Unfortunately I always seem to run into
the same problem: data representation!
The usual approach is to start with a fixed amount of undifferentiated
'source' -- usually large to *very* large (well in the Mb's). I want to
display this, and have the user (or the program itself) group bytes together
as opcodes, strings, data, whatever.
Additionally, I'd like to insert comments next to this defined bytes as well

You mean: you add them manually, as annotation ? I'd put annotation in a seperate textfile, with their anchors represented as text, too. I'd expect them to be relatively small, so reprentation would not be a big priority.


as 'any' text on the lines immediately above or below (think labels and
such).

If you really want to add names to variables and arguments, you'll probably need more. Maybe you'll need some concept of scope.


I have experimented both with flags connected directly to the source bytes,
and dumping the whole source into a linked list structure. The pro of the
first option is that display-to-offset is 1-1 -- the con being that (a)
either it is impossible to insert additional lines, or (b) with a flag
'additional lines available', each source byte has to be scanned for this
value whenever the cursor moves up or down a few lines. The pro of the

Ahh, you are talking user-interface now. The basic problem would be to
syncronise 'cursors' (into {code,annotation,display*N} )
I would probably (for sanity!) just keep them as separate 'addressable'
domains, and navigate through them seperately. (but keep the cursors tied together, or at least 'within their box'.


linked lists is that it's indifferent to line insertions. Each linked list
item knows to which source byte it points and how many of these are
'included' into its line -- a number which also may be zero. The con is,
we're talking about MBs here! On loading a large file creating the initial
(untagged) state costs seconds, even on my 3GHz, and when that's done moving

Don't be distracted by the size. mmap()ing an x MB file can be done in a second, and scanning it to collect the jumps&labels will take a few seconds, also. But that's only at startup time...



I don't understand what you mean by 'line insertions'. If it is only
an annotation thing, it would not be a big deal: either you add 'text' to an existing 'anchor', or you'll have to create an anchor.



down to the end also takes a long time.
I have an idea about combining the two approaches (linked lists which may
also span multiple lines, thus circumventing the initial load delay) but I
foresee that when nearing complete disassembly almost every single byte will
have its own link -- again, thousands and thousands of linked list items,

For a 1 MB binary, you could afford to use 10MB, even 100MB of memory, couldn't you. It just pointers, pointing to nothing, but that's life!
Plenty of RAM, so it seems ...



so
while initially the program may run like hell, at the end it will... run
like hell, I suppose.
Any idea's?


Keep them separated.


.. and after a good night's sleep (not!) came up with the following. The good ol' binary tree.
- store the first line number (= 0) and count (= total # lines) in a tree
- divide the source in two blocks, storing the first line number and count of each of those 2 blocks
- keep doing that until each individual block is easy to manage with sequential linked lists (a few k's should do it)


When lines are split or joined all I have to do is propagate the new line

I don't know what a 'line' is. I think annotation should be hooked
to labels, adrresses, or blocks. You could want to add a 20 line annotation to a 4byte block!


[offtopic: there is an anecdote (think it was a DDJ interview) about Donald Knuth. Why he did *not* want to (make) create the ultimate WYSIWYG TeX-editor: because we don't need it. rendering a page is far more simpler (and faster!) than shifting it up by one pixel.
BTW That's my wording, not DEK's]


number and counts up its own leaf of the tree up to the root. In that way the root node will always hold the total number of lines, and finding any individual line is easy enough. Joining and splitting lines will change the

What is this line-thing, again? Is this just about displaying it, than ?

balance of the tree -- but probably not enough to justify re-iterating. Anyway, this structure doesn't even have to be saved along with the data, it can be rebuilt easy enough.
The only caveat is when the last line of a block is joined to the first of


For the 'block' thing: blocks are naturally nested.
Beginning with the bineray as one thing, and carving out chunks later on vs. detecting smallest blockt and amalgating is a bottom-up/top-down thing. (-> totally irrelevant)


For your data-structure: there would probably be a choice of a block doing 'absolute addressing' vs addressing relative to it's enclosing block. (that would add even more 'address-spaces' to your model)

HTH,
AvK
.



Relevant Pages

  • Re: General structure of disassembler
    ... > I have a lot of hobbies, and one of these is writing disassemblers for ... > various CPUs and data file formats. ... > linked lists is that it's indifferent to line insertions. ... number and counts up its own leaf of the tree up to the root. ...
    (comp.programming)
  • General structure of disassembler
    ... I have a lot of hobbies, and one of these is writing disassemblers for ... various CPUs and data file formats. ... I have an idea about combining the two approaches (linked lists which may ... while initially the program may run like hell, ...
    (comp.programming)