Re: Programming project #1



On 29 Sep 2005 05:31:45 -0700, "Evenbit" <nbaker2328@xxxxxxxxxxx> wrote:

>
>laura fairhead wrote:
>> Hi ppl,
>>
>> I was reading a thread in this group about ideas for projects
>> to get good at assembly language and was reminded of something
>> I meant to do for a long time and that was compiling a list
>> of programming projects/puzzles for people to use, as a "sourcebook"
>> of ideas. I used to have a file of these things but lost it a
>> longtime ago so I'm starting afresh and this time I will put them
>> onto my website so anyone can access them. Anyway I'm mainly
>> writing to post this first one which occurred to me the other
>> day as I heard a radio broadcast which sounded like coded
>> information ( all was computer generated digitised speech
>> of letters of the alphabet; Charlie, Papa, Uniform...).
>> I'd be most delighted if anyone produces a working program
>> to see what they come up with! Well, have fun... :)
>>
>
>Welcome back Laura,
>
>Seems you 'material girls' just gotta have your collections. :) Well
>I stumbled upon a list intended for Python programmers, but some might
>find it interesting to try these using ASM:
>
>http://www.daniweb.com/techtalkforums/thread32007-1.html
>
>Nathan.

Hello Nathan :)

Thanx, I'll take a look sometime for inspiration.

Well... I just managed to find a very old article I wrote and never
published that is related to the problem. I've edited and revised
it and corrected some errors in the maths, hopefully it is all
okay now but if anyone spots anything wrong could they please
tell me (comments welcome).

bestwishes
laura


======================================================================


The Substitution Cypher - Making & Breaking by Laura Fairhead
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Terms & Conventions
~~~~~~~~~~~~~~~~~~~

# is used to mean 'some constant'

; here and there I will use the semicolon to seperate
assembly commands rather like C programmers do but
assembler programmers are disallowed (unless of course
you have a little pre-processor;)

:= read 'is defined as'

BNF a sort of minimal form of BNF notation is sometimes used
where

a | b represent item a (exclusive)-OR item b

[ ] represent an optional item
[ ]* represent an item repeated 0 or more times
[ ]+ represent an item repeated 1 or more times

cryptology the science (and art) of code making/breaking

cryptanalysis the science of code breaking

cryptography the science of code making

it is a common blunder to try to make codes before one
has studied what can be done in the way of breaking them.
the example given in this text is one such case of this.

cleartext is the name given to the text to be encrypted. This is the
secret that an attacker is trying to get. in crypto circles
the term 'plaintext' is also used.

cyphertext is the name given to the encrypted text.

polymorphic capable of changing form, pressumably an arbitary one

permutation given a group of objects in a set order a permutation
of this group is an arbitary rearrangement of the order
of the objects.

group this objects arbitary the order the of set in a
permutation given a of a rearrangement group of

pgrou siht jectsob biarytar eht derro eht fo set ni a
peutamrtino gveni a fo a.........

Introduction
~~~~~~~~~~~~

The first cypher that everybody ever learns is the old Roman one where
you have two wheels. Both wheels have the letters A-Z printed equidistance
around their radii. The inner wheel is fixed and the outer wheel is
rotatable. To create a cypher the user simply rotates the outer wheel
to some position and fixes it. They then take their cleartext and for every
letter, find that letter on the inner wheel and replace it with the
letter that is at the same radial position on the outer wheel.

So if they had rotated the outer wheel simply one step out of place
(from an original position where the two lined up A-A,B-B,C-C,...) every
letter would be transposed one letter up/down in the alphabet.

You might have:-

Inner Wheel
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Outer Wheel
B C D E F G H I J K L M N O P Q R S T U V W X Y Z A

Then if your cleartext was:-

THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG

The cyphertext would be:-

UIF RVJDL CSPXO GPY KVNQFE PWFS UIF MBAZ EPH


This cypher is part of a general class of cyphers called substitution
cyphers.


Polymorphic Virii
~~~~~~~~~~~~~~~~~

A polymorphic virus has the property that it can (supposedly arbitarily)
change form. This is intended as a weapon against scan-location, in that
if the form of the virus is not well-defined any more it is not possible
to simply search for a sample of it in the host body anymore. Techniques
against polymorphic virii should have to be sufficiently complex such as
to defy their easy detection.

Virii writers would like not to be concerned with this weapon on devel-
opement of the main code, and hence the primary technique used is that of
an encryption shell:-

+-----------------------------------+
| |
| DECRYPTOR |
| |
| decrypts, and possibly relocates |
| the virus body |
| |
| |
+-----------------------------------+
| | +---------------------------+
| ENCRYPTED VIRUS BODY |------->| POLYMORPHIC ENGINE |
| | | |
| performs virus action and uses the| | creates ENCRYPTOR |
| polymorphic engine to create a | | and DECRYPTOR |
| new virus copy | | |
| | | the encryptor is used to |
+-----------------------------------+ | create the next encrypted |
| virus body. the decryptor |
| is prepened to the body |
| so that the new virus can |
| "untangle" itself. |
+---------------------------+

The encryptor/decryptor pair are usually (but see below!) created
by using a base cypher method which consists of a simple loop over
each byte of the destination:-

MOV SI,OFFSET startbody
MOV CX,length

lop:
LODSB

some sequence of:-
XOR AL,# , ADD AL,#, SUB AL,#, ROL AL,# etc...

STOSB
LOOP lop

This is a bit of an over simplification really but at this point
it suffices to notice that the instructions used to change the byte
all have instructions which will undo them.

For example, if the encryptor is generated with the code:-

ROL AL,1
ADD AL,2
NEG AL

The decryptor is generated with the code:-

NEG AL (NEG AL is the opposite of NEG AL)
SUB AL,2 (SUB AL,2 is the opposite of ADD AL,2)
ROR AL,1 (ROR AL,1 is the opposite of ROL AL,1)

It is quite simple to make a program that will do this. You pick
random instructions,values and use them to write the encryptor code.
Then you simply repeat that operation but (i) replace the instructions
with their opposites (ii) write the code backwards, this will generate
the opposite (decryptor) code.

If you look at the op-codes for the instructions you will find that
it is not a very difficult task to tabulate them and get your generator
going.

The next problem is that you will still have patterns in the
decryptor. Both the start and the end of the loop are always fixed,
so the virus is never going to change in respect of these bytes.
Also there is a difficulty with the code inside the loop. Since it
uses such a limited set of instructions it should be quite easy to
recognise it for what it is.

There are two techniques for dealing with these problems.

(i) JUNK CODE

Certain instructions and sequences of instructions that do not affect
the overall operation of the code can be randomly inserted. The first
line of this defence against identification goes:-

XCHG AX,AX
XCHG BX,BX
XCHG CX,CX
.
.
INC AX;DEC AX
INC BX;DEC BX
.
.
PUSH AX;POP AX
PUSH BX;POP BX
.
.

etc, etc....

The above instructions/sequences are all basically NOP's (but be
careful to be aware when something is a NOP and when it is 90%, for
example INC AX;DEC AX interfers with the flags). You can simply make
a great big table of these sorts of instructions and get the encryptor/
decryptor generator to randomly insert them as it feels like it (of course
they are not necessary for the encryptor so you might put in a control).

Some virii go even further and start jumbling up the control flow
by creating dummy subroutines and jumping to them on whim, or segmenting
the code by using JMP's.

Once you've got your junk code generator going you can integrate it
with the cypher engine in such a way that the junk is inserted in the
code of the loop start and the loop end also. This way the virus should
be more difficult to identify by just looking for a static loop start
or end (since now there won't be one).

(ii) REGISTER ALLOCATION

A further method that is used is register allocation. This comes in
2 forms. The first thing you could do is to code the cypher generator
so that it will first choose a register for each of the loop variables.
The opcodes generated then will use these registers so that they are
not fixed anymore. This can get pretty complex, especially since you
can make your code so that it uses AL to load the byte, performs some
operations on it, then moves it into BL, performs some operations on that,
only then saving the result. If the register usage, further, actually changes
during the loop, the code should be even more difficult to identify.

As you can see there is no end to how complex you can make your
polymorphic engine (you could end up writing a compiler!). I will not
cover more ground here as this is only intended to be a background piece.
If you want to read more about virus writing there is plenty of
documentation to find on the internet (some references included at
the bottom of this article).

Substitution cyphers - Making
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Take any cypher that is generated using the form:-

LODSB

[ XOR AL,# | ADD AL,# | SUB AL,# | ROL AL,# | ROR AL,#
| NOT AL | NEG AL | INC AL | DEC AL ]+

STOSB

Since this is just really a permutation function, I wondered whether
it was possible to generate any permutation function with just this
template.

The first thing you can do is to reduce the instructions, since some
are expressible in terms of others:-

SUB AL,# is the same as ADD AL,-#
ROR AL,# is the same as ROL AL,8-#

NOT AL is the same as XOR AL,0FFh
NEG AL is the same as XOR AL,0FFh ; INC AL

That gets rid of a few, now you only have:-

ADD AL,#
ROL AL,#
XOR AL,#
INC AL
DEC AL

You can wipe 2 out in one fell swoop:-

ADD AL,# is the same as repeat(#) INC AL
DEC AL in the same as repeat(0FFh) INC AL

also

ROL AL,# is the same as repeat(#) ROL AL,1

And curiously, note that:-

XOR AL,080h

is the same as

repeat(080h) INC AL

( did you ever notice how ADD AL,080h and SUB AL,080h were the same
operation if you ignored the CF? )

The fact is that XOR AL,abcdefgh (where abcdefgh are the binary digits
of the operand) can be substituted with:-

repeat(080h*a) INC AL
ROL AL,1
repeat(080h*b) INC AL
ROL AL,1
.
.
.
repeat(080h*h) INC AL
ROL AL,1

Now we only have 2 atomic operations left namely

ROL AL,1 and INC AL

Any of the functions using a sequence of the XOR,ROL,ADD,SUB... class
of operations can similarly be expressed by an equivalent sequence of
these two elements. The fact that some of these sequences will now be
ridiculously long does not bother us, we are solely concerned with viewing
the cypher function from a mathematical standpoint.

Subsequently I am going to use the notation

ROL # meaning repeat(#) ROL AL,1
INC # meaning repeat(#) INC AL

Permutation functions are commonly written in the form:-

(0 1 2 3 4 .............. 0FDh 0FEh 0FFh)
(1 2 3 4 5 .............. 0FEh 0FFh 000h)

With the top line indicating the domain, the bottom the result of the
function. If only one could prove that using our 2 operators it would
be possible to create the function:-

(0 1 2 3 4 .... j-2 j-1 j j+1 j+2 ........ 0FDh 0FEh 0FFh)
(j 1 2 3 4 .... j-2 j-1 0 j+1 j+2 ........ 0FDh 0FEh 0FFh)

If you had this simple swap function you could use it to build any
permutation function of your choice.

First, notice that this function swaps elements 0 and j. In order to
swap any 2 elements a and b you could just use the property that:-

swap(a,b) = swap (0,a)
swap (0,b)
swap (0,a)

Once you have the ability to swap it is a well-known mathematical truth
that you can create any desired order you like. Imagine a collection of
different coloured balls all in a line (if you visualize a lot of this
stuff like this it gets a lot easier). You have to get them into a certain
specified order however you are only allowed to do it by performing a series
of swaps. It is straighforward to do this; simply proceeding from one end
of the line to the other. For each place you check if the ball is already
in the correct place there is nothing to do, else you swap it with the
correct ball (which is always further down the line). Applying this to our
case where we have 256 values, it is plain to see that any order can be
achieved after a maximum of 256 swaps (well 255 actually, because the last
is never going to get swapped). It follows that if only we had this swap
function we could build any permutation function using a sequence of
swap functions.

After staring at the problem in the face for some deal of time and not
making much ground I decided that rather than look for a direct solution
it would be interesting to try to find some perhaps more easily solvable
sub-problems instead. Much thanx to Paul Hsieh through who's kind help
steered me this way and eventually to solving this problem.

Further from the mathematical notation illustrated above for representing
permutation functions, any permutation function may be divided into a set
of cycles. Given any number a, and a function f it is obvious that the
sequence:-

{ a,f(a),f(f(a)),f(f(f(a))),...... }

must eventually repeat (since the number of values is finite). Any permut-
ation function can therefore be divided into a series of these cycles and
they indeed characterise the function itself. Mathematicians will commonly
represent a permutation function in a form where they omit any values which
stay unaffected by the function and write the other ones in the order that
they cycle in groups deliminated with brackets.

I try to make this clearer with some examples

(i)

(0 1 2 3 4 5 6 7 8 9 A B C D E F)
(1 2 3 4 5 6 7 8 9 A B C D E F 0)

Would be represented simply as:-

(0 1 2 3 4 5 6 7 8 9 A B C D E F)

(ii)

(0 1 2 3 4 5 6 7 8 9 A B C D E F)
(0 1 2 3 4 7 8 9 A B C D E F 6 5)

Would be represented as:-

(5 7 9 B D F)(6 8 A C E)

The 0-4 are omitted as they stay the same.

(iii)

(0 1 2 3 4 5 6 7 8 9 A B C D E F)
(9 1 2 3 4 5 6 7 8 0 A B C D E F)

Would be represented as:-

(0 9)


You can see that this is a much more useful form of representing the
function, since it is highly illustrative of the STRUCTURE.


!! from here on ALL values are in HEXADECIMAL


I constructed a program that would take an input in the form

[ I # R # ]*

This represents your INC/ROL function, so if I wanted to test the
function:-

INC 49
ROL 1
INC 5F
ROL 7
INC 3
ROL 2

I'd just use the command line I49R1I5FR7I3R2

The program would output for me the function in the mathematical notation
illustrated above. This was an essential tool in investigating the nature
of functions using INC and ROL. When investigating theoretical things
one should never be afraid to do a bit of number crunching.

After a bit of intuition and a bit of dabbling I found the first very
interesting function, namely:-

INC FF
ROL 1
INC 2
ROL 7

This produces the permutation function (0 80). This really got me hooked
because what we have here is a function that affects only the 2 values.
Okay, so it's not our swap function, but it's along the right lines.

After some more dabbling I found another very interesting combination:-

ROL 1
INC 1
ROL 7
INC 80

This yields the sparkler (80 81 82 83 84 ...... FC FD FE FF). If you can
visualize it it starts to get very bright; we have one function with swaps
the number at the beginning of the first half of our values with the number
at the beginning of the second. We have another function which rotates the
second half of our values around. It is quite easy to see that all you need
to do is rotate the second half so many, swap, then rotate the second half
effectively back, in order to get a swap function. This swap function will
only swap 0 with a number j where 80<=j<=FF, but that's quite good enough
we must be almost done.

If you simply define f:=(ROL 1,INC 1,ROL 7,INC 80) and
g:=(INC FF,ROL 1,INC 2,ROL 7). You can then define h(n)=f^n.g.f^-n
Where f^-n would be repeat(n)(INC 80,ROL 1,INC FF,ROL 7), the logical
inverse of f^n.

The function h(n) is then equivalent to swap(0,100-n), or:-

(0 100-n)

so swap(0,v)=h(100-v)

Of course there are a couple of bits to clear up here; v has to fall
in the range 1 to 80, also we want to swap any 2 arbitary values.

Both problems are solved in one throw by falling on the property that
any two values must have a difference between 80 and FF. This is so because
in modulo arithmetic (where all the numbers wrap from FF to 0) there are
always two differences between any 2 values. Just in the same way that
you can read a clock face with the minutes approaching the hour, or you
can read it with the minutes passing the hour. You can always read the
time such that the number of minutes is less than or equal to 30.

5:45 could be said to be 15 minutes to six, or it could be said to be
45 minutes after five. If you wanted to always just have 30 minutes or
less in the minutes part, you could always choose a way of reading the time
such that your goal would be satisfied. The opposite is hence always true;
you could always read the time as having the minutes greater than or equal
to 30.

17:36 36 minutes past 5
18:01 59 minutes to 7
01:29 31 minutes to 2

If you have two values a,b both in [0,FF] you have 2 differences (or
deltas). Here;

0...........a........b..........FF

delta1=b-a
delta2=100-b+a

We can take it that a<=b, since swap(a,b) is the same as swap(b,a) so
your algorithm can just take a as being the smaller value without any
further problems.

Note that;

INC 100-t
swap(0,v)
INC t

Is the same as;

swap(t,t+v)

If delta1>=80 then we would have t=a, otherwise we want t=b.

Thus we would start by incrementing by 100-t;

if(b-a)>=80
t=a
else
t=b
INC 100-t

The difference v, would be b-a when t=a and 100-(b-a) when t=b
so;

swap(a,b)={
if(b-a)>=80
t=a; v=b-a
else
t=b; v=100-(b-a)
INC 100-t
h(100-v) ;swap(0,v)
INC t
}

Or in its grand entirety:-

swap(a,b)=
{
;
;prologue so that we always have a<=b
;
if(a>b)
{ t=a;a=b;b=t } ; or XCHG in our mother tongue
;
;find tranposition factor
;
if(b-a)>=80
t=a; v=b-a
else
t=b; v=100-(b-a)
;
;tranpose into range
;
INC 100-t
;
;swap
;
repeat(100-v) { ROL 1;INC 1;ROL 7;INC 80 }
{ INC FF;ROL 1;INC 2;ROL 7 }
repeat(100-v) { INC 80;ROL 1;INC FF;ROL 7 }
;
;tranpose back
;
INC t
}

Technically speaking all the INC's and ROL's in the above should have
the prefix "emit". This is a term compiler writers use to mean generate
*this* code. The INC's and ROL's are not part of the algorithm, they
are the program generated by it.

Of course, for those of you who like challenges there is a further
question. This algorithm illustrates how it is possible to give any
arbitary permutation using a code sequence of only INC's and ROL's.
( and consequently, of course, fills our aim that we could make ANY
substitution cypher using XOR's,ADD's,SUB's,NOT's,......... )

The question then suggests itself of how to find the OPTIMUM sequence
of INC's and ROL's that would give this arbitary permutation. A companion
problem to this might be finding the optimum sequence if you re-extended
the instruction set back so that you could use any XOR,NOT,ROL,ADD, etc..

I'm sure this is not a problem for the faint hearted ! I tried something
similar with logic simplification years ago and found myself getting
nowhere fast, however it is a most fascinating subject. I believe Mr Hsieh
is putting this problem on his puzzle page so I would look out there
for further reading...


Substitution cyphers - Breaking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I have studied dozens of polymorphic engines in virii and was simply
amazed by the apparent dearth of one iota of cryptographic knowledge among
their creators. It is most unfortunate that so great an effort was being
put into making the decryptor section of the code as polymorphic as possible
while little attention was given to the cypher method itself.

This is because the substitution cypher has to be one of the most trivial
of all cryptographic defences to break. I do not imbue myself with much
more than a very little knowledge of cryptography however was able to
design an algorithm that would be able to scan-locate a substition cypher
encrypted virus body with 100% efficiency and at speed.

The question you must ask is that if someone with very little cryptographic
know-how can achieve this, how worried must the AV experts be about it all?
I am sure they were laughing all the way to the bank, while scores of
virii writers thought they were being ingenious, the AV people were simply
performing the same tired old routine of scan-locating the unwanted
machine code.

If you assume that the cleartext is posessed, a substition cypher on it
will always have the same structure.

( 0 1 2 3 ....................... FC FD FE FF)
(o0 o1 o2 o3 ....................... oFC oFD oFE oFF)

So in order to scan for the original you simply look for the same
pattern that the original code makes, eg; given a source code of-

01 05 09 01 04 02 05 05 04

Convert it into the expression;

offset test (oxx= no test)
0 o01
1 o05
2 o09
3 =offset(0)
4 o04
5 o02
6 =offset(1)
7 =offset(1)
8 =offset(4)

This is then used by a scanning loop to locate the code in memory.
It's a wheeze to do this and very quickly too.


Epilogue
~~~~~~~~
Never make the mistake of assuming that the complexity of method of
composition of a problem implies the complexity of solving it.

I believe my point is highlighted in a most explicit form here with
the substitution cypher. One un-versed in the mathematics of what one was
actually doing may think that it should be more difficult to break an
encryption engine:-

LODSB
INC AL
ROL AL,3
XOR AL,0AAh
SUB AL,071h
NEG AL
DEC AL
DEC AL
XOR AL,055h
STOSB

Than one that used:-

LODSB
XOR AL,0AAh
STOSB

A code maker who assumed this would be making a serious blunder. No
matter how many instructions they added, their attacker would duefully
un-code their code without any additional effort whatsoever.

There is a common understanding of this sort of mistake generally
in mathematics. It is the mistake of soley relying on one's intuition.
Any scientist has to realize that intuition is a blunt and sometimes
dangerously inaccurate tool on which to rest.

So is the substitution cypher valueless as a cryptographic defence?

On the contrary, conceptually ANY and EVERY conceivable cypher
method MUST by definition be expressable as a substition cypher.
Given a set of cleartexts C={C_1,C_2,C_3,C_4,......,C_n} a given
cypher creates a cyphertext E={E_1,E_2,E_3,E_4,.....,E_n} where
E_i corresponds to C_i. In order to be decyphered there must be
a one-to-one correspondence between E and C, else it would not
be possible to deambiguiate encyphered texts from one another.
This means that a given cypher f is (albeit in abstract) really
just a substition mapping:-

(C_1 C_2 C_3 C_4 C_5 ................ C_n-2 C_n-1 C_n)
(E_1 E_2 E_3 E_4 E_5 ................ E_n-2 E_n-1 E_n)

The decryption function f^-1 is then:-

(E_1 E_2 E_3 E_4 E_5 ................ E_n-2 E_n-1 E_n)
(C_1 C_2 C_3 C_4 C_5 ................ C_n-2 C_n-2 C_n)

It may not be immediately obvious how this property is useful.


bibliography/references
~~~~~~~~~~~~~~~~~~~~~~~

usenet sci.crypt FAQ for intro
pobox.com/~qed Paul Hsieh web site

codebreakers site ftp://ftp.coderz.net
cypherphunks


======================================================================




--
echo alru_aafriehdab@xxxxxxxxxxxxx |sed 's/\(.\)\(.\)/\2\1/g'
.