Re: What do I do with Art Of Assembly?

From: Herbert Kleebauer (klee_at_unibwm.de)
Date: 01/17/04


Date: Sat, 17 Jan 2004 18:06:37 +0100

Beth wrote:

> > Please give us your definition of an assembler (in a few
> > sentences and not in hundred of pages). Then we can first
> > discuss this definition an then, using this definition, prove
> > whether HLA is an assembler or not.
>
> Okay, here's a "first shot" at that:
>
> Assembly language is an augmented human-readable notation for
> formatting data more easily for _direct_ presentation to the CPU...

??????????

>
> Note what I have and haven't said in the above...

Now exactly that happened which I tried to avoid with the words:
(in a few sentences and not in hundred of pages)

I can't answer all this, so I will first give a general
statement and then just a few remarks to your posting.

In an extremely simplified view (ignoring any I/O) a computer
is a combination of a CPU and memory. The memory stores
instructions for the CPU as well as the data for the these
instructions. When a program is started, a part of the
memory is initialized with a sequence of bytes (the
code and data for this program) and the execution is
started by loading the program counter with an address
within this byte sequence. Programming is the generation
of this byte sequence.

The simplest way of programing is to use a hex editor.
If you know all the opcode values for all the different
processor instructions and calculate all operand and
branch address by hand, a hex editor is all you need
to generate this byte sequence, but it is very tiring
for larger programs. As an example we could write the
following program using a hex editor:

a0 07 00 fe c0 eb f9

Because it is very difficult to remember all the
numbers for the opcodes and to calculate all addresses
by hand, we could make life a little bit easier by using
symbolic names for the opcodes and addresses. Replacing
the opcode "a0" by "move.b" and the address "0007" by
"dat" etc. we get an equivalent representation of the
above program which is much easier to write and to
understand:

loop: move.b dat,r0
        inc.b r0
        br.b loop
dat: dc.b 10

If this program is feed to the assembler, we get the same
byte sequence as above:

00000000: a0 0007 loop: move.b dat,r0
00000003: fe c0 inc.b r0
00000005: eb f9 br.b loop
00000007: 0a dat: dc.b 10

loop........... 00000000 dat............ 00000007

We still have the functionality of the hex editor: the
dc.b, which is the main instruction of any assembler
because it is able to replace any other instruction.
Note that "move.b" also is not a processor instruction
(there are no processor instructions at all in an
assembler program, the processor instructions are
stored in the output file of the assembler) but an
assembler instruction which instructs the assembler to
write the appropriate opcode to the output file.

While the use of an assembler simplifies the generation
of a program it doesn't hide anything. You are using
nothing but the language of the processor (using
symbolic names for opcodes and addresses). After resolving
the symbolic names you can translate any line in the
assembler source to the corresponding byte sequence without
looking at any other line in the source. This makes
the assembler language the ideal language to learn
to understand the working of a CPU, which is the
main reason to learn assembly programming.

Such a "real" assembler surely isn't well suited to write
"real" applications. You can improve it by adding macro
support, but a Macro Assembler isn't an assembler anymore,
like a motor bicycle isn't a bicycle (In German we have to
words: Zweirad for a vehicle with two wheels and Fahrrad
for a Zweirad driven by muscle power. I asked an English
colleague who told me, that in English there is only the
one word bicycle which normally is interpreted as Fahrrad
and not as Zweirad). You can't translate any line of the
source independent from all the other lines anymore
(which is also true if you add branch optimization).

You can also add some HLL features to make programming
still more easier, but the result has nothing to do with
an assembler. Finally you will get a real HLL compiler
with inline assembler support. This inline assembler
mostly is a "real" assembler again.

> It's an "augmented" notation because we're NOT only concern with
> "machine instructions"...this you should know better than most, in
> fact, as your favourite _directive_ is "DB"...that's NOT a "machine
> instruction", it's an assembler directive...other examples would be
> "BITS 16/32" in NASM or would be implied in RosAsm's "[ Variable: D$
> ? ]" (merely a syntactical variant of "DD" there)...there are, of
> course, further "augmentations" in various assemblers to the "machine
> instructions" such as macros, section declarations, MASM's "HLLisms"
> like "invoke" and others...but those first examples I've given: "DB",
> "BITS 32" and "[ Variable: D$ ? ]" have been selected _specifically_
> to show that _NO_ x86 assembler is able to live without at least
> _SOME_ "assembler directives"...even DEBUG understands "DB" and "DW"

There are only assembler instructions and no processor instructions
in an assembler source file. The processor instructions are in the
output file of the assembler.

> in its assembly mode...NASM would be _completely unable_ to work out
> the encodings between 16-bit and 32-bit mode without the "BITS"
> directive (because the correct encoding to be used entirely depends on
> the "mode" the CPU is in...this _could_ be set via a command line
> option but that would stop mixed 16-bit / 32-bit coding, such as a DOS
> program that goes into protected mode)...

What do you want to say here? It is obvious that you have to tell
the assembler for which processor mode the output should be
generated (even my trivial assembler has this instruction:
seg16 / seg32 ).

> Further, note that I've said "formatting data"...this is because, of
> course, CPU instructions are _also_ "data"...they are just "numbers"
> in memory locations...what turns them into "code" is that the CPU
> fetches them as part of the code stream and interprets them, according
> to its encoding rules, as particular instructions...this is another
> important distinction for things like placing "DB" directives into a
> code stream to "emulate" a machine instruction that's not natively
> supported by the assembler...
>
> And this distinction is further of importance because it calls into
> question what is specifically meant by "instruction", anyway...ask
> wolfgang and he'll tell you that his special hex-encoder-assembler
> (much lower-level than most of what we use ;) has the ability to
> specify whether a particular "MOV" instruction is a "load" or "store"
> version of that instruction...because, indeed, on the x86, it's
> possible to encode even this most basic (and other most basic)
> instructions in _more than one way_...we can also look to other
> commonly used instructions and question them on this facet..."MOVS" is
> NOT a real machine instruction...there is a "MOVSB" encoding and then
> there's a "MOVSW/D" encoding (the actual size - 16 or 32 bit - being
> selected by the segement "mode", the presence or absence of prefixes
> and so forth)...this is why, on some assemblers, the "MOVS"
> instruction requires a "dummy" operand to specify the desired size
> (the opcode itself has NO operands at all, this is part of the
> "directive" nature of "MOVS" in many assemblers)...

What has all this to do with the definition of an assembler?
If there are to different opcodes, then either you have
to use two different symbolic names for these instructions
or if you use the same name, then it must be possible to
choose the right opcode by looking at the operands.

 
> Hence, "MOV" is a _directive_..."MOVS" is a _directive_...and then,
> seeing as we've had a discussion about the "JMP" / "Jxx" instructions
> and the issue of "jump size", then we can also see that this is a
> _directive_ too on a lot of assemblers...again, there is a short
> encoding, a near encoding and a far encoding (for the uncoditional
> "JMP")...furthermore, "JMP", like CALL, is capable of _direct or
> indirect_ operands (e.g. there's an encoding that uses an immediate in
> the instruction and an encoding which jumps indirectly via a memory
> address)...

Same answer as above.

 
> Hence, in total, there are - wait for it - _5_ different encodings of
> "JMP" (ignoring prefixes), which is a total of _9_ different actual
> forms of "JMP" as 4 of the encodings may take a size prefix to select
> 16/32 bit, depending on the current CPU "bitage")...does your
> assembler provide 9 different "JMP" mnemonics? If not, then where's
> our precious "1:1" requirement? Note, also, that any x86 assembler is
> going to find "1:1" incredibly difficult because of Intel's own
> encodings...some opcodes "double up" and may be 16 or 32 bit,
> depending on _both_ the prefix and the current CPU mode...worse, the
> meaning of the prefix reverses (toggles) in the different
> "modes"...so, that, simply, due to Intel's own encodings, it is
> _impossible_ to determine, from looking at the opcodes alone, what we
> actually have...it's NOT possible...you may "presume", of course, from
> knowing that it's Windows code (got to be 32-bit due to the OS rules
> and the prefixes must, therefore, "flip" to 16-bit mode) or by taking
> a 16-bit or 32-bit "perspective" on the matter and seeing which is
> "most likely"...but, strictly, no x86 assembler will be able to make
> the determination without the use of a "BITS"-like directive or to
> supply "ambiguous" mnemonics, leaving the "bitage" determination to a
> human being...
>
> You could once have claimed "1:1" and so forth on a different
> architecture...but Intel's messy encodings actually mean that this
> _cannot_ - when we're pedantic to the facts - be true for any x86
> assembler, even when you use "ambiguous" mnemonics (and I've never
> seen a _single_ assembler which does use "ambiguous" mnemonics like:
> "mov (e)ax, (e)bx" :)...not that it matters because, note, that if
> we've got a constant, then the interpretation is _crucial_...is it
> "mov ax, 1234h; int 21h" or "mov eax, 123421CDh"? Get it wrong and
> you've messed up your disassembly in a major way...not only are the
> instructions nothing alike but you've "swallowed" up another
> instruction if you wrongly take a 32-bit interpretation...

1:1 means, you can convert any line in the source code to
the corresponding output byte sequence (after resolving the
symbolic address) independent of any other line in the source
code. But, as stated above, you have to tell the assembler
for which processor mode (16/32 bit) the output should be
generated.

> Not even DEBUG can lay claim to being able to operate without
> directives in its assembly mode - it has "DB" and "DW" - and even if
> DEBUG was expanded to properly include 32-bit instructions and MMX/SSE
> and so forth (apparently, someone has done something like this :), it
> would still not be most people's "assembler of choice", exactly
> because it's limited...remembering that it doesn't use "labels" and
> you have to work out all those addresses yourself...
>
> Hence, we've got a pretty damning set of examples: "MOV" (also, "ADD",
> "ADC", "SUB", "SBB", "OR", "AND", etc. by extension of the same
> principle :), "MOVS" (also, "LODS", "SCAS", "STOS", "INS", "OUTS" by
> extension of the same principle ;), "JMP" (also, "JC", "JNC", "JO",
> "JNO", etc. by extension of the same principle...worse, throw in the
> "synonyms" too..."JE" is the exact same instruction as "JZ", it's just
> a convenient synonym...which makes that most certainly a
> "directive")...and there's _MUCH MORE_ than merely this...but, seeing
> as we've just covered a massive amount of the basic x86 instruction
> set, I think the point's been made, hasn't it?

Sorry, no point found.

> If your definition of x86 assembly includes "1:1" or "no directives"
> then your definition describes an impossibility...the Intel encodings
> and suggested mnemonics ("Intel syntax") themselves mean it's not
> possible...as I was explaining this point to Frank before, it's
> actually amusing because AT&T syntax _better_ fulfills that (but still
> can't manage it)...
>
> Plus, our "directives" are pretty powerful, you know...they select the
> encoding based on the _operand_ in a large amount of cases..."MOV"
> alone can cover a number of encodings..."MOV al," cuts things
> down...and "MOV eax," or "MOV ax," are dependent on the "bitage"...as
> well as the fact that most of these encodings are "reverseable"...and
> movements to segment register (CS, DS, etc.) or system registers
> (CR0, etc.) are encoded completely differently and have a much more
> limited application (can't write an immediate directly into a segment
> register with "MOV", for example)...yet, all these things are being
> covered by a _single_ "MOV" mnemonic!! Sounds suspiciously "HLL", in
> fact...

As I said above:
  What has all this to do with the definition of an assembler?
  If there are to different opcodes, then either you have
  to use two different symbolic names for these instructions
  or if you use the same name, then it must be possible to
  choose the right opcode by looking at the operands.

>
> Your notion of "purity" is actually a _myth_...the CPU itself and its
> encodings do not permit it themselves...
>
> But, you know, as interesting as that is, that _ISN'T_ my defence of
> HLA at all...that's just pointing out, before we even start, that
> you're trying to argue for something that itself is non-existent...in
> short, you've failed before you've started because this "purity" only
> exists in people's imaginations...if you really relate _any_
> assembly - that you may deem "pure" - to the actual encodings, there's
> an awful lot of jiggery-pokery involved..."isn't there?", she says to
> all those who've played with the encodings and might have attempted an
> assembler of some kind...

I have failed with what? I think you have failed to understand
what an assembler is.

 
> What, in fact, does HLA do? How does it differ from what you call
> "pure ASM" but I prefer to call "traditional ASM" (in answering this
> question, we'll begin to see why I make that distinction and stick by
> it)? Let's compare differences, shall we?
>
> Data declarations, under any assembler, are _directives_...there is no
> "Intel syntax" for them nor "standards" nor any agreement...the best
> we have is "DB", "DW", "DD" and so forth as a loose "de facto"
> standard for this...but then again, MASM lets you call these "BYTE",
> "WORD", "DWORD"...RosAsm calls them "B$", "W$", "D$"...NASM has a
> different set of names for uninitialised and initialised versions
> ("DB" vs. "RESB", "DW" vs. "RESW", etc. :)...some assemblers support
> structures, some don't (plus "unions" too)...and, note, though
> "structure support" usually includes some "dot notation" thing, it's
> quite possible to simply allow _data_ to be declared in structures
> (for the naming and "sizeof" benefits) but insist that instructions
> are written in a more "pure" way ("[ ebx + SecondMember ]"), not that
> there _is_ any "standard" or Intel ever specify this element of
> assembly language at all in their documentation...
>
> None of this relates to "machine instructions"...so, all of these
> variant implementations are as good as one another...and,
> fundamentally, "data is data is data"...a byte is always a byte, a
> dword is always a dword...it matters not what programming language you
> use, this is always the case and data declarations are semantically
> equivalent...whereas, we couldn't be too sure what the ASM for C's
> "VarA++" might equate to (more than one possibility there), we _know_
> 100% that "char VarA = 3;" _is_ equivalent to "VarA db 3"...whereas
> HLLs may "insert" things and we don't know what's happening "behind
> the scenes" with the _CODE_, there is no such problem with _data
> declarations_...
>
> Hence, HLA recognises this and uses 100% HLL-inspired data
> declarations (but also includes ASM-inspired alignment options too,
> for precise location of data...note, NO such thing under supposedly
> "real" RosAsm...it takes care of all the data and its alignment and so
> forth in totally 100% "HLL" way...for example, in one "title", I
> finish with a word data declaration and at the start of the "next"
> title, I have another word data declaration...these "titles" are all
> listed equally so when I say "next", I'm referring to the location of
> the "tab" in the control, goodness knows if that actually relates to
> its location in the source file...further, this data all gets thrown
> into the data section...there is NO method supplied by this "real"
> assembler in order to choose where - code or data - nor is a single
> thing mentioned about "alignment" or "padding" in the documentation...
>
> So, are my two word-sized data declarations aligned to the byte, word,
> dword, paragraph or page? Don't know...because they are less than
> 32-bits, does RosAsm "pad" these out to dword-sizes for "alignment"
> reasons? Don't know...
>
> Hence, we don't actually know with this supposedly "nothing behind the
> scenes real assembler" whether it's actually "safe" and "correct" to
> grab _both_ word-sized data declarations in one dword-sized operation
> (that is, "mov eax, FirstWord" which "picks up" the second word at the
> same time because - we Hope though RosAsm does not explain nor
> guarantee - that the second word is right next door in the data
> section...and we have no choice at all about it being in a data
> section, as RosAsm - in a way that's 100% "HLL" to me - deals with
> data "behind the scenes"...you can't control what it does, you don't
> really know what it's doing (alignment? Padding? No explanation of
> that in the documentation)...

If you can't determine where the variables are stored in the binary
output of a RosAsm program (by reading only the source code) then
RosAsm also isn't an assembler, but this doesn't make HLA an assembler.
 
 
> Well, sorry, if that's "pure ASM" in demonstration then HLA is far
> better...I can put the data wherever I like and I can align it...I
> _know_ - both because it's clear from the source and because Randy has
> detailed all the issues in his comprehensive reference manual and
> AoA - whether it's "safe" to grab two WORD-sized pieces of data with a
> single 32-bit instruction...better yet, I can use the "union"
> declarations to _specifically write out_ that this data can be
> accessed in more than one way, giving me the ability to assign
> meaningful symbolic names and make it totally understandable to anyone
> else reading that this data gets used as one 32-bit _and_ two 16-bit
> values...and there's not an ounce of difference between this "union"
> and using "DW" twice, then an "ORG" to "backtrack" and declaring a
> "DD" over the top of them...in a sense, all "union" does is give a
> much easier to read notation for getting the assembler to do exactly
> that...needless to be said, there is NOT the slightest difference
> whatsoever in what data actually ends up in the final file, regardless
> of syntax..."data is data is data"...
>
> HLA's use of a full and comprehensive "HLL-like" data declarations is
> the thing that I found absolutely brilliant...formatting out things
> like "tables of table to pointers to tables to procedures", which
> _are_ very useful because many of the best algorithms can use very
> complex data structures...it's a doddle...you can create a "template"
> structure with "data", "previous" and "next" pointers and then declare
> your "doubly linked list" (a common enough requirement for many good
> algorithms, right? :) with complete ease...doing the same using "DB"
> alone is a total fudging nightmare in comparison...

If this is brilliant, why you don't you use C++, C# or java? I never
said, that it isn't easier to write programs with HLA than an
assembler. All I said is, that HLA isn't an assembler an therefore
isn't suited to teach students assembly programming (to help them
to understand the working of the CPU). But if you don't want to
do assembly programming, I don't see any advantage of using HLA.
Use a HLL compiler and if necessary inline assembler or link to
code written in a real assembler.
 
> What about HLA's "procedure" assistance? You know, don't ask me too
> much about it, as I never use it...I switch them off with a couple of
> directives at the start of the source code and then do everything
> myself...because, you see, it's entirely true...you can simply
> _IGNORE_ these things...they DO NOT in any way "get in your
> way"...data typing? Same thing, in fact...HLA can do it but if you
> declare things as "BYTE", "WORD" and "DWORD" then the only thing HLA
> checks is _size_, just the same as any other assembler...those silly
> "IF" statements? Again, wouldn't know...never use them...but I did
> once have a go for some example code I posted up here and the HLA "IF"
> is much like the MASM one, except that, in fact, it's _more limited_
> and guarantees that it _WON'T_ play around with registers, something
> MASM does NOT guarantee...

If in some points MASM is even worse than HLA, makes this HLA an
assembler?
 
> The macro stuff, of course, is all _compile-time_ pre-processing...it
> all goes on during compilation and is all done by run-time...these are
> merely just methods of helping a programmer to write their code by
> wrapping up "common" things in simple macros...HLA macros are more
> powerful - that is, they provide more facilities for writing them -
> but they don't differ (other than syntax) to the essence of macros on
> any other assembler that has such things...
>
> The machine instructions themselves? Well, the operands are reversed
> and put into brackets...that's about all we can say there...nothing
> else is any different about those at all...oh, fair enough, SSE
> instructions aren't currently supported because Randy didn't have a
> machine to test that out on at the time...you can, of course, "DB"
> your own versions and use a macro so you'd never know the
> difference...many assemblers also don't have SSE, simply because those
> instructions turned up after the assembler was last updated...and
> nothing stops them being included, Randy just didn't have them
> originally for _practical_ reasons (no SSE-capable machine, can't test
> it so best to leave them out until Randy can get an SSE-capable
> machine...in the meantime, if it's needed then use "DB", just as you
> would do with any other assembler for non-native instructions)...
>
> You're fighting a phantom; Because you know what's really "HLL" here?
> This attitude...things are _ONLY_ "behind the scenes" in your mind
> because you've not even downloaded HLA and given it a go to see what
> it is...RosAsm's data declarations are pure HLL because only HLLs give
> me no choice or explanation of what's happening (heck, _C_ and _C++_
> are lower-level in some regards)..."Intel syntax" is HLL...these
> notiongs of "purity" demonstrate that you've not really _studied_ the
> encodings sufficiently and considered the ramifications of them to
> realise that "pure 1:1" is a complete nonsense on Intel CPUs...
>
> I'm a low-level, do-everything-yourself, explicit coder and I use
> HLA...and there's NO conflict between the two whatsoever...I know
> that's totally NOT what many people want to hear but it's _fact_ and
> it would be wrong of me to lie otherwise just to support some "phantom
> world" in your own imagination...to massage your ego that you're a
> "real" coder, with all the macho superiority nonsense that comes along
> with that...unfortunately for you, as you'd like it were I "clueless
> newbie" about this, I was not born yesterday about these things and do
> know what I'm talking about...I've hand-rolled my own PE files and
> it's not particularly difficult or complicated...you might think using
> "DB" to do this makes you some "macho" kind of guy but it's not that
> impressive...so, you read the PE file format documents? Big bloody
> deal...

You have done it, Rene has done it, I have done it and surely also
Randy and many many others have done it. And as you said, it's trivial.
I never said something else. All I said was, that everybody who
learns assembly programming should learn to understand ANY byte in
the generated binary file. Because the DOS com file format is much easier,
I also suggest, that this file format should be used for learning
assembly programming.
 
> Because, let's cut to the chase here, this has very little to really
> do with ASM - "pure" or otherwise - it all to do with being the
> "biggest" in the group...and Randy's sitting at the top in everyone's
> perceptions so you gain "kudos" as "guru", if you topple him from his
> position...it's an awkward little thing regards the "guru" status that
> Randy _has_ read those boring (and they really are boring, aren't
> they? ;) "decompiler theory" papers and he took the time to understand
> them...because what's a real pain about Randy getting in the way of
> this "supreme guru award" is that he takes a _professional_ attitude -
> he looks these things up and _makes sure_ he understands them -
> whereas you want the title but you don't want the work...you'd like to
> be able to "hack" your way to the top, yeah? "I'm a practical coder
> and this makes me the best!", right? Except, unlike some of the
> teachers you may have met, Randy can walk the walk just as well as he
> can talk the talk...one tends to keep quiet about this for
> "politeness" reasons but the size and complexity of HLA, the
> productivity of spitting that out while producing _professional_
> quality documentation _and_ he still has time to help people out and
> fight an argument or two on a few groups (hey, I stick to one group
> because I couldn't deal with the volume handling CLAX as well...I just
> "look-in" from time to time now instead :)...he's in a _totally
> different league_ to most of us...different ballpark, sunshine...

I never care about persons. If somebody says something, which I
think is wrong, then a say this and it doesn't matter if it is
a carpenter, a professor or somebody who got the nobel price.
   

> And _THAT_ is what this all really about...whereas other assemblers
> are rooted in '60s ideas and just try to desparately "extend" that to
> modern OSes, HLA just said "sod it!" and looked again at assembly with
> a modern perspective...but Randy _has_ stayed true to the roots and it
> is a _true assembly language_ at the core...and the "old guard" are
> just disturbed that they are "stuck" with the old ways and can't
> comprehend the "new"...
>
> This is simply an actual demonstration of the famous saying: "New
> ideas never succeed, they simply wait around for the old ideas - and
> the keepers of the old ideas - to eventually die off"...

If you invent a motorbike call it motorbike and not bicycle.
Nobody has said HLA is bad (ok, I think a HLL is better),
but HLA isn't an assembler and it shouldn't be called an
assembler and it is a very bad thing (for the students) if
it is used to teach assembly programming.