How a linker works (continued)



In the last installement we looked into the object files and what they
contain.

Some people insisted that I was generalizing too much and there could be C implementations without object files (like C interpreters) and C implementations that do not link files in separate compilation but just parse and digest each module, making the whole code generation step in the linker, from an unknown representation.

Granted, werid implementation and special options may exists. Here I am speaking about the very common (or most common case) where the compiler produces traditional object files, stored in the disk somewhere.

Those object files in an abstract way contain:
(1) A symbol table that specifies whiwh symbols are exported and which symbols are imported
(2) Several "Sections" containing the data of the program. (Code instructions, initialized tables, and just reserved space)
(3) A series of relocation records that specify which parts of the data (code or tables) must be patched by the linker to insert the external symbols required by the module

The linking process
-------------------

The linker opens all object files that it receives, and builds a symbol table. In this table we have several sets of symbols

(a) The set of defined symbols, not in the common section. All this symbols have a fixed address already.

(b) The set of symbols in the common section

(c) The set of undefined symbols that have been seen as externals but where the definition is not yet processed.

Symbols can be moved from the undefined set, into the common or into the defined symbols.

This needs some explanation. Suppose you have in the file file1.c the following declaration:

int iconst;

The symbol ‘iconst’ will be assigned to the common section that is initialized to zero at program startup. But consider what happens if you include ‘file2.c’ in the link, that contains the declaration:

int iconst = 53433;

The linker will move the symbol ‘iconst’, from the common section to the data section. The definition in file1.c will be lost. If you relied in "iconst" being zero at startup now you are wrong.

And there are worst things that can be done:
file1.c:
int buf[256];

file2.c:

int buff[512];

The linker will leave ‘buf’ in the common section, but will set its size to the bigger value, i.e. 512. This is harmless, but beware that you make a definition in a file3.c

int buff[4] = {0,0,0,0};

Your table will have a size of just four positions instead of 512!!

This can be tested, for instance, with the following two files:
file t1.c
int tab[12];

File t2.c
int tab[256];
int main(void){return 0;}

Linking t1.c and t2.c with MSVC 8 we obtain an executable *without any warnings* not even at the highest warning level.

In the linker of lcc-win I added a warning:
in t1.obj warning: '_tab' (defined with size 48)
is redefined in t2.obj with size 1024

The linker of gnu doesn't emit any warning:
root@ubuntu:/tmp# gcc -Wall t1.c t2.c
root@ubuntu:/tmp#

The explanation that will be commonly given for this behavior is that any definition in the "common" section (non initialized data) is a "tentative definition" and only valid until another definition is seen by the linker.

Dave Hanson, one of the authors of the original lcc compiler told me this, when we discussed about this problem:

jacob:
>> is char *p;
>> a "tentative definition"?

Dave Hanson:

<<quote>>
For the record, the declaration for p is indeed a tentative definition, but that status persists only until the end of the compilation unit, i.e., the end of f1.c. Since there's no subsequent external definition of p, the tentative declaration acts exactly as if there was a file-scope declaration for p with an initializer equal to 0. (See Sec. 3.7.2 of the ANSI Standard, which I think is Sec. 6.7.2 of the ISO Standard). As a result, p is null at program startup--assuming there are no other similar declarations for p.

This example illustrates nicely a problem with the common storage model: You can't determine whether or not a declaration is also a definition by examining just the program text, and it's easy to get strange behaviors. In this example, there was only one definition, which passes without complaint from linkers. In the stricter definition/reference model, linkers would complain about multiple definitions when combining the object code for f1.c and f2.c. This example also shows why it's best to initialize globals, because linkers will usually catch these kinds of multiple definitions.

The common model also permits C's (admittedly weak) type system to be violated. I've seen programmers declare "int x[2]" in one file and "double x" in another one, for example, just so they can access x as a double and as a pair of ints.

For a good summary of the four models of external definitions/declarations, see Sec. 4.8 in Harbison & Steele, C: A Reference Manual, 4th ed., Prentice-Hall, 1995.

<<end quote>>
------------------------------------------------------------------------------

Relocating all symbols
----------------------

Let's come back to our linker however. I will outline with lcclnk and windows as exmaples, but in Unix and many other systems, the operations done by the linker are very similar.

The next thing to do is to go through all symbols, and decide whether they will go into the final symbol table or not. Many of them are discarded, since they are local symbols of each compilation unit.

Global symbols need to be relocated, i.e. the ‘value’ of the symbol has to be set to its final address. This is easy now that the position of the section that contains the symbol is exactly known: we just go through them setting the value field to the right number.


The algorithm outline is simple:
1. Read the relocation information from the object file.

2. According to the type of relocation, adjust the value of the symbol. The relocations supported by lcclnk are just a few: the pc-relative relocation (code 7, and code 20), the normal 32-bit relocation (code 6), and two types of relocations for the debug information, code 10 and 11.

3. Save the position within the executable file where the relocation is being done in the case of relocation type 6 (normal 32 bits relocation), to later build the .reloc section if this is needed.

Normally this is needed only when generating a dll, since executables aren’t relocated under windows.

The .reloc section of the executable is data for the program loader, to tell it where are the addresses that it should patch when loading the file into memory.

Other linkers more complicated than lcc's support more fancy stuff. A symbol can be included only once even if it appears several times, and many other things

Performing the relocations
--------------------------
More specifically, what the linker does, is fixing the data/code references that each module contains from all others, patching the code with the offsets that the external symbols have, now that the positions of all sections are known. For a C source line like:

foo(5);

the linker reads the corresponding relocation record emitted by the compiler, and looks up the symbol ‘foo’ in the symbol table. It patches the zeroes that are stored by the assembler at the position following the call opcodes with the relative offset from the point of the call to the address of foo. This will allow the processor to make a PC relative call instruction: the 4 bytes after the call instruction contain a 32-bit offset to the address of foo.

Using the utility pedump, you can see this process. Consider the following well-known program:

#include <stdio.h>
int main(int argc,char *argv[])
{

printf("Hello\n");
}

Compile this with:
lcc -g2 hello.c
Now, disassemble hello.obj with pedump like this:
pedump /A hello.obj
You will see near the end of the long listing that follows, the disassembled text section:

section 00 (.text) size: 00020 file offs: 00220
--------------------------------------------------------------
_main: Size 18
--------------------------------------------------------------
[0000000] 55 pushl %ebp
[0000001] 89e5 movl %esp,%ebp
Line 5
[0000003] 6800000000 pushl $0 (_$2) (relocation)
[0000008] e800000000 call _printf (relocation)
[0000013] 83c404 addl $4,%esp
Line 6
[0000016] 5d popl %ebp
[0000017] c3 ret
[0000018] 0000 addb %al,(%eax)

Let’s follow the relocation to the function printf. You will see that pedump has a listing of the relocations that looks like this:
Section 01 (.text) relocations

Address Type Symbol Index Symbol Name
------- ---- ------------ ----- ----
4 DIR32 4 _$2
9 REL32 16 _printf

The linker will then take the bytes starting at the address 4, and put the address of the symbol 4 in the symbol table of main.obj. It will search the address of printf, and put the relative address, i.e. the difference between the address of printf and the address of main+9 in those bytes starting at byte 9.

As you can see there are several types of relocations, each specifying a different way of doing these additions. The compiler emits only three types of relocations:
• Type 6 : Direct 32-bit reference to the symbols virtual address
• Type 7: Direct 32-bit references to the symbols virtual address, base not included.
• Type 20: PC-relative 32-bit reference to the symbols virtual address.

This last one is the one used in the relocation to printf. We have to know too that the relative call is relative to the next instruction, i.e. to the byte 13 and not to the byte 9. Happily for us the linker now knows this stuff...

--------------------------------------------------------------------
Next installment will treat the object libraries

--
jacob navia
jacob at jacob point remcomp point fr
logiciels/informatique
http://www.cs.virginia.edu/~lcc-win32
.



Relevant Pages

  • Re: extern variable
    ... the linker is from another GNU package ... GNU doc, ... actually invoked under) gcc marks items in the object files as ... AFAIK GNU ld always does this for common, ...
    (comp.lang.c)
  • Re: Object files
    ... >> .c file are source files in human reabable form, ... > that depends on the brand of compiler you use. ... > linker or another version of the same compiler. ... or more object files stored one after the other in a single ...
    (comp.lang.c)
  • Re: How a linker works (continued)
    ... implement software that handles a particular object format. ... The linker opens all object files that it receives, ... not in the common section. ... is seen by the end of the translation unit, it is treated as ``int ...
    (comp.lang.c)
  • How do linkers work?
    ... this is an error because the linker ... The COFF format used under windows 32 bit ... are seen by the linker in the same way as many object files. ... speaking about extended characters in identifiers, ...
    (comp.lang.c)
  • Re: assembly & C linking woes....
    ... : object files. ... : Linker Warning: DOSSEG directive ignored in module asm.asm ... : Linker Error: Undefined symbol _ASMClsV in module main.c ... PUBLIC ASMClsV ...
    (comp.lang.asm.x86)