The C Compilation Process

  • Compilation is the process of translating human readable source code into machine code that the processor can execute.
  • Binary Code is the machine code that systems execute.
  • Binary Executable Files, or Binaries, store the executable binary program, that is, the code and data belonging to each program.
#include <stdio.h>
#define FORMAT_STRING "%s"
#define MESSAGE "Hello, world!\n"

int main(int argc, char *argv[]) {
  printf(FORMAT_STRING, MESSAGE);
  return 0;
}

The C Compilation Process.

Preprocessor

  • Expands macros(#define) and #include directives into pure C code.
  • Every #include directive, the header is copied in its entirety.
  • Every #define directive is fully expanded everywhere it is used.
$ gcc -E -P compilation_example.c

typedef long unsigned int size_t;
typedef unsigned char __u_char;
typedef unsigned short int __u_short;
typedef unsigned int __u_int;
typedef unsigned long int __u_long;
/* ... */
extern int sys_nerr;
extern const char *const sys_errlist[];
extern int fileno (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__)) ;
extern int fileno_unlocked (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__)) ;
extern FILE *popen (const char *__command, const char *__modes) ;
extern int pclose (FILE *__stream);
extern char *ctermid (char *__s) __attribute__ ((__nothrow__ , __leaf__));
extern void flockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__));
extern int ftrylockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__)) ;
extern void funlockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__));

int main(int argc, char *argv[]) {
  printf("%s", "Hello, world!\n");
  return 0;
}

Compiler

  • Takes the preprocessed code and translates it into assembly language.
  • Most compilers also perform heavy optimization in this phase, typically configurable as an optimization level through command line switches such as options -O0 through -O3 in gcc.
  • Compilation phase produce assembly language and not machine code because it’s better to instead have a language dedicated compiler that emits generic assembly code and have a single universal assembler that can handle the final translation of assembly to machine code for every language.
  • Output of the compilation phase is an assembly file, which is in reasonably human-readable form, with symbolic information intact.
  • All references are purely symbolic.
  • Compilers use an optimization called dead code elimination to find instances of code that can never be reached in practice so that they can omit such useless code in the compiled binary.
  • Each source code file corresponds to one assembly file.
  • Takes .c file as input and produces .s assembly file.
$ gcc -S -masm=intel compilation_example.c
$ cat compilation_example.s

  .file "compilation_example.c"
  .intel_syntax noprefix
  .section .rodata
.LC0:
  .string "Hello, world!"
  .text
  .globl main
  .type main, @function
main:
.LFB0:
  .cfi_startproc
  push rbp
  .cfi_def_cfa_offset 16
  .cfi_offset 6, -16
  mov rbp, rsp
  .cfi_def_cfa_register 6
  sub rsp, 16
  mov DWORD PTR [rbp-4], edi
  mov QWORD PTR [rbp-16], rsi
  mov edi, OFFSET FLAT:.LC0
  call puts
  mov eax, 0
  leave
  .cfi_def_cfa 7, 8
  ret
  .cfi_endproc
.LFE0:
  .size main, .-main
  .ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
  .section .note.GNU-stack,"",@progbits

Assembler

  • Takes assembly files as input and produces object files (modules) as output.
  • Each assembly file corresponds to one object file.
  • Object files contain machine instructions that are in principle executable by the processor.
  • Takes .c file as input and produces .o object file.
$ gcc -c compilation_example.c
$ file compilation_example.o
compilation_example.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
  • Relocatable Files can be placed at any position in the memory. It’s an indication of the file being an object/module. It’s important since object files are compiled independently from each other and assembler has no way to know the order to link them into. Making them relocatable allows them to be linked in any order to construct a complete executable.
  • Object files contain Relocation Symbols that specify how function and variable references must be resolved. References that rely on a relocation symbol, such as an object file referencing one if its own functions/variables by absolute address, are known as Symbolic References.

Linker

  • Links together all object files together to form a single coherent executable, which will be loaded at a particular memory address.
  • Can incorporate an additional optimization pass called link-time optimization (LTO).
  • Linker resolves all symbolic references now that the arrangement of modules is known after linking.
  • Static libraries are merged into the executable allowing all references to be resolved entirely. Symbolic references to dynamic libraries are left unresolved even in the final executable (will be resolved during execution).
$ gcc compilation_example.c
$ file a.out
a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=d0e23ea731bce9de65619cadd58b14ecd8c015c7, not stripped

$ ./a.out
Hello, world!

 

Symbols and Stripped Binaries

  • Symbols keep track of symbolic names and records which binary code and data correspond to. They provide a mapping from high-level names to address and size. This information is required by linker.
$ readelf --syms a.out

Symbol table '.dynsym' contains 4 entries:
  Num: 	Value 				Size 	Type 	Bind 	Vis 	Ndx 	Name
  0: 		0000000000000000 	0 		NOTYPE  LOCAL 	DEFAULT UND
  1: 		0000000000000000 	0 		FUNC 	GLOBAL	DEFAULT UND 	puts@GLIBC_2.2.5 (2)
  2: 		0000000000000000 	0 		FUNC 	GLOBAL 	DEFAULT UND 	__libc_start_main@GLIBC_2.2.5 (2)
  3: 		0000000000000000 	0 		NOTYPE 	WEAK 	DEFAULT UND 	__gmon_start__

Symbol table '.symtab' contains 67 entries:
  Num: 	Value 				Size 	Type 	Bind 	Vis 	Ndx 	Name
  ...
  56: 	0000000000601030 	0 		OBJECT 	GLOBAL 	HIDDEN 	25 		__dso_handle
  57: 	00000000004005d0 	4 		OBJECT 	GLOBAL 	DEFAULT 16 		_IO_stdin_used
  58: 	0000000000400550 	101 	FUNC 	GLOBAL 	DEFAULT 14 		__libc_csu_init
  59: 	0000000000601040 	0 		NOTYPE 	GLOBAL 	DEFAULT 26 		_end
  60: 	0000000000400430 	42 		FUNC 	GLOBAL 	DEFAULT 14 		_start
  61: 	0000000000601038 	0 		NOTYPE 	GLOBAL 	DEFAULT 26 		__bss_start
  62: 	0000000000400526 	32 		FUNC 	GLOBAL 	DEFAULT 14 		main
  63: 	0000000000000000 	0 		NOTYPE 	WEAK 	DEFAULT UND 	_Jv_RegisterClasses
  64: 	0000000000601038 	0 		OBJECT 	GLOBAL 	HIDDEN 	25 		__TMC_END__
  65: 	0000000000000000 	0 		NOTYPE 	WEAK 	DEFAULT UND 	_ITM_registerTMCloneTable
  66: 	00000000004003c8 	0 		FUNC 	GLOBAL 	DEFAULT 11 		_init
  • Focusing on ‘main’, we can see it will be loaded at address ‘0x400526’ when the binary is loaded into memory and it’s size is 32bytes. ‘FUNC’ shows that we are dealing with a function symbol.
  • Debugging symbols are typically generated in DWARF format for ELF binaries (usually embedded inside) and PDB (Microsoft Portable Debugging) format for PE binaries (separate file).

Stripped Binaries

  • On stripping a binary, only a few symbols are left in the .dynsym symbol table. These are used to resolve dynamic dependencies (such as references to dynamic libraries) when the binary is loaded into memory, but they’re not much use when disassembling.
$ strip --strip-all a.out
$ file a.out

a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=d0e23ea731bce9de65619cadd58b14ecd8c015c7, stripped

$ readelf --syms a.out

Symbol table '.dynsym' contains 4 entries:
  Num:	Value 				Size 	Type 	Bind 	Vis 		Ndx 	Name
  0: 		0000000000000000 	0 		NOTYPE 	LOCAL 	DEFAULT 	UND
  1: 		0000000000000000 	0 		FUNC 	GLOBAL 	DEFAULT 	UND 	puts@GLIBC_2.2.5 (2)
  2: 		0000000000000000 	0 		FUNC 	GLOBAL 	DEFAULT 	UND 	__libc_start_main@GLIBC_2.2.5 (2)
  3: 		0000000000000000 	0 		NOTYPE 	WEAK 	DEFAULT 	UND 	__gmon_start__

 

Loading and Executing a Binary

Loading an ELF binary on a Linux-based system.

  • A binary’s representation in memory does not necessarily correspond one-to-one with its on-disk representation, like collapsing a string of zeros to a single one to save space, and re-expand while loading into the memory.
  • A new process is setup for the program to run in, including a virtual address space. Subsequently, the operating system maps an interpreter into the process’s virtual memory to load the binary and perform the necessary relocations. On Linux, the interpreter is typically a shared library called ld-linux.so. On Windows, the interpreter functionality is implemented as part of ntdll.dll. After loading the interpreter, the kernel transfers control to it, and the interpreter begins its work in user space.
  • The interpreter then maps the dynamic libraries required into the virtual address space (using mmap or an equivalent function) and then resolves any relocations left in the binary’s code sections to fill in the correct addresses for references to the dynamic libraries.
  • Linux ELF binaries come with a special section called .interp that specifies the path to the interpreter.
$ readelf -p .interp a.out

String dump of section '.interp':
  [0] /lib64/ld-linux-x86-64.so.2

Citation: Practical Binary Analysis.