2021-08-31 02:10:17 -04:00
|
|
|
# boostrapping a (Linux x86-64) C compiler
|
|
|
|
|
|
|
|
Compilers nowadays are written in languages like C, which themselves need to be
|
|
|
|
compiled. But then, you need a C compiler to compile your C compiler! Of course,
|
2022-02-23 23:50:49 -08:00
|
|
|
the very first C compiler was not written in C.
|
|
|
|
First, people made assemblers, then simple programming languages,
|
|
|
|
then, eventually, it was possible to make a C compiler.
|
|
|
|
In this repository, we'll explore how that's done. Each directory here
|
|
|
|
is a "stage" in the process. The first one, `00`, is a hand-written
|
|
|
|
executable, and the last one, `05`, is a C compiler. Each directory has its own
|
2021-08-31 02:10:17 -04:00
|
|
|
README explaining what's going on.
|
|
|
|
|
|
|
|
You can run `bootstrap.sh` to run through and test every stage.
|
2021-09-01 18:27:51 -04:00
|
|
|
To get HTML versions of all README pages, run `make`.
|
2021-08-31 02:10:17 -04:00
|
|
|
|
2021-11-10 00:52:34 -05:00
|
|
|
Note that the executables produced in this series will only run on
|
|
|
|
64-bit Linux, because each OS/architecture combination would need its own separate
|
|
|
|
executable.
|
|
|
|
|
2021-11-10 12:55:41 -05:00
|
|
|
## table of contents
|
|
|
|
|
|
|
|
- [stage 00](00/README.md) - a program converting a text file with
|
|
|
|
hexadecimal digit pairs to a binary file.
|
|
|
|
- [stage 01](01/README.md) - a language with comments, and 2-character
|
|
|
|
command codes.
|
|
|
|
- [stage 02](02/README.md) - a language with labels
|
2021-11-14 00:33:40 -05:00
|
|
|
- [stage 03](03/README.md) - a language with longer labels, better error messages, and less register manipulation
|
2022-01-07 11:07:06 -05:00
|
|
|
- [stage 04](04/README.md) - a language with nice functions and local variables
|
2022-01-07 14:31:52 -05:00
|
|
|
- [stage 04a](04a/README.md) - (interlude) a simple preprocessor
|
2022-02-19 19:43:13 -08:00
|
|
|
- [stage 05](05/README.md) - a C compiler capable of compiling TCC
|
2023-07-17 10:13:10 +01:00
|
|
|
- [stage 06](06/README.md) - an interpreter capable of executing zig
|
2021-11-10 00:52:34 -05:00
|
|
|
|
|
|
|
## prerequisite knowledge
|
2021-08-31 02:10:17 -04:00
|
|
|
|
2022-02-13 11:24:30 -05:00
|
|
|
If you want to follow along with this series, you'll probably want to know about:
|
2021-08-31 02:10:17 -04:00
|
|
|
|
|
|
|
- number bases -- if a number is preceded by 0x, 0o, or 0b in this series, that
|
|
|
|
means hexadecimal/octal/binary respectively. So 0xff = FF hexadecimal = 255
|
|
|
|
decimal.
|
|
|
|
- bits, bytes, kilobytes, etc.
|
|
|
|
- bitwise operations (not, or, and, xor, left shift, right shift)
|
|
|
|
- 2's complement
|
2021-11-10 00:52:34 -05:00
|
|
|
- ASCII, null-terminated strings
|
2021-08-31 17:16:30 -04:00
|
|
|
- how pointers work
|
2021-08-31 02:10:17 -04:00
|
|
|
- how floating-point numbers work
|
2022-02-23 23:50:49 -08:00
|
|
|
- what a compiler is
|
|
|
|
- what an executable file is
|
|
|
|
- what a system call is
|
|
|
|
- what a CPU is
|
|
|
|
- what a CPU architecture is
|
|
|
|
- what a CPU register is
|
|
|
|
- what the (call) stack is
|
2021-08-31 02:10:17 -04:00
|
|
|
|
2022-02-23 23:50:49 -08:00
|
|
|
If you're unfamiliar with x86-64 assembly, you should take a look at the instruction list below.
|
2021-08-31 02:10:17 -04:00
|
|
|
|
|
|
|
## principles
|
|
|
|
|
|
|
|
- as simple as possible
|
|
|
|
|
|
|
|
Bootstrapping a compiler is not an easy task, so we're trying to make it as easy
|
|
|
|
as possible. We don't even necessarily need a standard-compliant C compiler, we
|
2022-02-19 19:43:13 -08:00
|
|
|
only need enough to compile someone else's C compiler. Specifically, we'll be
|
2022-02-23 23:50:49 -08:00
|
|
|
using [tcc](https://bellard.org/tcc/) since it's written (mostly) in C89.
|
2021-08-31 02:10:17 -04:00
|
|
|
|
|
|
|
- efficiency is not a concern
|
|
|
|
|
|
|
|
We will create big and slow executables, and that's okay. It doesn't really
|
2022-02-23 23:50:49 -08:00
|
|
|
matter if compiling TCC takes 30 as opposed to 0.01 seconds; once
|
|
|
|
we compile it with itself, we should get the same executable either way.
|
2021-08-31 02:10:17 -04:00
|
|
|
|
|
|
|
## reflections on trusting trust
|
|
|
|
|
|
|
|
In 1984, Ken Thompson wrote the well-known article
|
2021-09-01 18:27:51 -04:00
|
|
|
[Reflections on Trusting Trust](http://users.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf).
|
2022-02-19 19:43:13 -08:00
|
|
|
This is one of the inspirations for this project. A brief summary is:
|
|
|
|
it's possible to create a malicious C compiler which will
|
2021-08-31 02:10:17 -04:00
|
|
|
replicate its own malicious functionalities (e.g. detecting password-checking
|
|
|
|
routines to make them also accept another password the attacker knows) when used
|
|
|
|
to compile other C compilers. For all we know, such a compiler was used to
|
2022-02-23 23:50:49 -08:00
|
|
|
compile gcc, say, and so all programs around today could be compromised. Of
|
2021-08-31 02:10:17 -04:00
|
|
|
course, this is practically definitely not the case, but it's still an
|
|
|
|
interesting experiment to try to create a fully trustable compiler. This
|
|
|
|
project can't necessarily even do that though, because the Linux kernel, which
|
2021-11-14 00:33:40 -05:00
|
|
|
we depend on, is compiled from C, so we can't fully trust *it*. To
|
|
|
|
create a *fully* trustable compiler, you'd need to manually write
|
|
|
|
an operating system to a USB key with a circuit or something,
|
|
|
|
assuming you trust your CPU...
|
2021-11-10 12:55:41 -05:00
|
|
|
I'll leave that to someone else.
|
2021-08-31 02:10:17 -04:00
|
|
|
|
2022-01-07 20:30:29 -05:00
|
|
|
## instruction set
|
|
|
|
|
|
|
|
x86-64 has a *gigantic* instruction set. The manual for it is over 2,000 pages
|
2022-01-27 18:52:39 -05:00
|
|
|
long! To make things simpler, we will only use a small subset.
|
2022-01-07 20:30:29 -05:00
|
|
|
|
|
|
|
Here are all the instructions we'll be using. If you're not familiar with
|
2022-02-13 11:24:30 -05:00
|
|
|
x86-64 assembly, you might want to look over these.
|
|
|
|
|
|
|
|
x86-64 has 16 integer registers: rax, rbx, rcx, rdx, rsp, rbp, rsi, rdi, r8, r9, r10, r11, r12, r13, r14, r15.
|
|
|
|
We will almost entirely be using the first 8 of these.
|
|
|
|
al refers to the bottom 8 bits of rax, likewise with bl, cl, dl;
|
|
|
|
ax refers to the bottom 16 bits of rax, likewise with bx, cx, dx;
|
|
|
|
eax refers to the bottom 32 bits of rax, likewise with ebx, ecx, edx.
|
|
|
|
|
|
|
|
x86-64 also has 16 floating-point registers: xmm0 through xmm15. We'll only be using
|
|
|
|
xmm0 and xmm1. These registers can hold either four 32-bit floating-point numbers (`float`s) or
|
|
|
|
two 64-bit floating-point numbers (`double`s), but we'll only be using them to hold either one
|
|
|
|
`float` or one `double`.
|
2022-01-07 20:30:29 -05:00
|
|
|
|
|
|
|
In the table below, `IMM64` means a 64-bit *immediate* (a constant number).
|
|
|
|
`rdx:rax` refers to the 128-bit number you get by combining `rdx` and `rax`.
|
|
|
|
|
|
|
|
```
|
2022-02-09 22:44:27 -05:00
|
|
|
ax bx cx dx sp bp si di
|
|
|
|
0 3 1 2 4 5 6 7
|
|
|
|
|
2022-01-07 20:30:29 -05:00
|
|
|
┌──────────────────────┬───────────────────┬────────────────────────────────────────┐
|
|
|
|
│ Instruction │ Encoding │ Description │
|
|
|
|
├──────────────────────┼───────────────────┼────────────────────────────────────────┤
|
|
|
|
│ mov rax, IMM64 │ 48 b8 IMM64 │ set rax to the 64-bit value IMM64 │
|
2022-02-11 14:34:54 -05:00
|
|
|
│ mov rbx, IMM64 │ 48 bb IMM64 │ set rbx to the 64-bit value IMM64 │
|
2022-02-12 21:27:57 -05:00
|
|
|
| add rax, IMM32 | 48 05 IMM32 | add IMM32 (signed) to rax |
|
2022-01-07 20:30:29 -05:00
|
|
|
│ xor eax, eax │ 31 c0 │ set rax to 0 (shorter than mov rax, 0) │
|
|
|
|
│ xor edx, edx │ 31 d2 │ set rdx to 0 │
|
|
|
|
│ mov RDEST, RSRC │ 48 89 (DEST|SRC<<3|0xc0) │ set register DEST to current │
|
|
|
|
│ │ │ value of register SRC │
|
|
|
|
│ mov r8, rax │ 49 89 c0 │ set r8 to rax (only used for syscalls) │
|
|
|
|
│ mov r9, rax │ 49 89 c1 │ set r9 to rax (only used for syscalls) │
|
|
|
|
│ mov r10, rax │ 49 89 c2 │ set r10 to rax (only used for syscalls)│
|
2022-02-10 21:09:52 -05:00
|
|
|
| movsx rax, al | 48 0f be c0 | sign-extend al to rax |
|
|
|
|
| movsx rax, ax | 48 0f bf c0 | sign-extend ax to rax |
|
|
|
|
| movsx rax, eax | 48 63 c0 | sign-extend eax to rax |
|
|
|
|
| movzx rax, al | 48 0f b6 c0 | zero-extend al to rax |
|
|
|
|
| movzx rax, ax | 48 0f b7 c0 | zero-extend ax to rax |
|
|
|
|
| mov eax, eax | 89 c0 | zero-extend eax to rax |
|
2022-01-07 20:30:29 -05:00
|
|
|
│ xchg rax, rbx │ 48 93 │ exchange the values of rax and rbx │
|
|
|
|
│ mov [rbx], rax │ 48 89 03 │ store rax as 8 bytes at address rbx │
|
|
|
|
│ mov rax, [rbx] │ 48 8b 03 │ load 8 bytes from address rbx into rax │
|
|
|
|
│ mov [rbx], eax │ 89 03 │ store eax as 4 bytes at address rbx │
|
|
|
|
│ mov eax, [rbx] │ 8b 03 │ load 4 bytes from address rbx into eax │
|
|
|
|
│ mov [rbx], ax │ 66 89 03 │ store ax as 2 bytes at address rbx │
|
|
|
|
│ mov ax, [rbx] │ 66 8b 03 │ load 2 bytes from address rbx into eax │
|
|
|
|
│ mov [rbx], al │ 88 03 │ store al as 1 byte at address rbx │
|
2022-02-10 18:09:32 -05:00
|
|
|
│ mov al, [rbx] │ 8a 03 │ load 1 byte from address rbx into al │
|
2022-01-07 20:30:29 -05:00
|
|
|
│ mov rax, [rbp+IMM32] │ 48 8b 85 IMM32 │ load 8 bytes from address rbp+IMM32 │
|
|
|
|
│ │ │ into rax (note: IMM32 may be negative) │
|
2022-02-23 23:50:49 -08:00
|
|
|
│ mov rax, [rsp+IMM32] │ 48 8b 84 24 IMM32 │ load 8 bytes from rsp+IMM32 into rax │
|
2022-01-07 20:30:29 -05:00
|
|
|
│ mov [rbp+IMM32], rax │ 48 89 85 IMM32 │ store rax in 8 bytes at rbp+IMM32 │
|
|
|
|
│ mov [rsp+IMM32], rax │ 48 89 84 24 IMM32 │ store rax in 8 bytes at rsp+IMM32 │
|
|
|
|
│ mov [rsp], rbp │ 48 89 2c 24 │ store rbp in 8 bytes at rsp │
|
|
|
|
│ mov rbp, [rsp] │ 48 8b 2c 24 │ load 8 bytes from rsp into rbp │
|
2022-02-10 16:06:17 -05:00
|
|
|
│ lea rax, [rbp+IMM32] │ 48 8d 85 IMM32 │ set rax to rbp+IMM32 │
|
|
|
|
│ lea rsp, [rbp+IMM32] │ 48 8d a5 IMM32 │ set rsp to rbp+IMM32 │
|
2022-02-27 15:31:02 -05:00
|
|
|
| int3 | cc | raise trap signal -useful for debugging|
|
2022-02-10 18:09:32 -05:00
|
|
|
| movsq | 48 a5 | copy 8 bytes from rsi to rdi |
|
|
|
|
| rep movsb | f3 a4 | copy rcx bytes from rsi to rdi |
|
|
|
|
│ push rax │ 50 │ push rax onto the stack │
|
2022-02-12 14:31:14 -05:00
|
|
|
│ pop rax │ 58 │ pop a value off the stack into rax │
|
2022-01-07 20:30:29 -05:00
|
|
|
│ neg rax │ 48 f7 d8 │ set rax to -rax │
|
|
|
|
│ add rax, rbx │ 48 01 d8 │ add rbx to rax │
|
|
|
|
│ sub rax, rbx │ 48 29 d8 │ subtract rbx from rax │
|
|
|
|
│ imul rbx │ 48 f7 eb │ set rdx:rax to rax * rbx (signed) │
|
2022-01-21 23:24:18 -05:00
|
|
|
│ cqo │ 48 99 │ sign-extend rax to rdx:rax |
|
2022-01-07 20:30:29 -05:00
|
|
|
│ idiv rbx │ 48 f7 fb │ divide rdx:rax by rbx (signed); put │
|
2022-02-23 23:50:49 -08:00
|
|
|
│ │ │ quotient in rax, remainder in rdx │
|
2022-01-07 20:30:29 -05:00
|
|
|
│ mul rbx │ 48 f7 e3 │ like imul, but unsigned │
|
2022-02-23 23:50:49 -08:00
|
|
|
│ div rbx │ 48 f7 f3 │ like idiv, but unsigned │
|
2022-01-07 20:30:29 -05:00
|
|
|
│ not rax │ 48 f7 d0 │ set rax to ~rax (bitwise not) │
|
|
|
|
│ and rax, rbx │ 48 21 d8 │ set rax to rax & rbx (bitwise and) │
|
|
|
|
│ or rax, rbx │ 48 09 d8 │ set rax to rax | rbx (bitwise or) │
|
|
|
|
│ xor rax, rbx │ 48 31 d8 │ set rax to rax ^ rbx (bitwise xor) │
|
|
|
|
│ shl rax, cl │ 48 d3 e0 │ set rax to rax << cl (left shift) │
|
|
|
|
│ shl rax, IMM8 │ 48 c1 e0 IMM8 │ set rax to rax << IMM8 │
|
2022-02-23 23:50:49 -08:00
|
|
|
│ shr rax, cl │ 48 d3 e8 │ set rax to rax >> cl (unsigned) │
|
|
|
|
│ shr rax, IMM8 │ 48 c1 e8 IMM8 │ set rax to rax >> IMM8 (unsigned) │
|
|
|
|
│ sar rax, cl │ 48 d3 f8 │ set rax to rax >> cl (signed) │
|
|
|
|
│ sar rax, IMM8 │ 48 c1 f8 IMM8 │ set rax to rax >> IMM8 (signed) │
|
2022-01-07 20:30:29 -05:00
|
|
|
│ sub rsp, IMM32 │ 48 81 ec IMM32 │ subtract IMM32 from rsp │
|
|
|
|
│ add rsp, IMM32 │ 48 81 c4 IMM32 │ add IMM32 to rsp │
|
|
|
|
│ cmp rax, rbx │ 48 39 d8 │ compare rax with rbx (see je, jl, etc.)│
|
|
|
|
│ test rax, rax │ 48 85 c0 │ equivalent to cmp rax, 0 │
|
|
|
|
│ jmp IMM32 │ e9 IMM32 │ jump to offset IMM32 from here │
|
|
|
|
│ je IMM32 │ 0f 84 IMM32 │ jump to IMM32 if equal │
|
|
|
|
│ jne IMM32 │ 0f 85 IMM32 │ jump if not equal │
|
|
|
|
│ jl IMM32 │ 0f 8c IMM32 │ jump if less than │
|
|
|
|
│ jg IMM32 │ 0f 8f IMM32 │ jump if greater than │
|
|
|
|
│ jle IMM32 │ 0f 8e IMM32 │ jump if less than or equal to │
|
|
|
|
│ jge IMM32 │ 0f 8d IMM32 │ jump if greater than or equal to │
|
|
|
|
│ jb IMM32 │ 0f 82 IMM32 │ jump if "below" (like jl but unsigned) │
|
|
|
|
│ ja IMM32 │ 0f 87 IMM32 │ jump if "above" (like jg but unsigned) │
|
|
|
|
│ jbe IMM32 │ 0f 86 IMM32 │ jump if below or equal to │
|
|
|
|
│ jae IMM32 │ 0f 83 IMM32 │ jump if above or equal to │
|
2022-02-13 11:24:30 -05:00
|
|
|
│ sete al │ 0f 94 c0 │ set al to 1 if equal; 0 otherwise │
|
|
|
|
│ setne al │ 0f 95 c0 │ set al to 1 if not equal │
|
|
|
|
│ setl al │ 0f 9c c0 │ set al to 1 if less than │
|
|
|
|
│ setg al │ 0f 9f c0 │ set al to 1 if greater than │
|
|
|
|
│ setle al │ 0f 9e c0 │ set al to 1 if less than or equal to │
|
|
|
|
│ setge al │ 0f 9d c0 │ set al to 1 if greater than or equal to│
|
|
|
|
│ setb al │ 0f 92 c0 │ set al to 1 if below │
|
|
|
|
│ seta al │ 0f 97 c0 │ set al to 1 if above │
|
|
|
|
│ setbe al │ 0f 96 c0 │ set al to 1 if below or equal to │
|
|
|
|
│ setae al │ 0f 93 c0 │ set al to 1 if above or equal to │
|
2022-02-11 13:52:19 -05:00
|
|
|
| movq rax, xmm0 | 66 48 0f 7e c0 | set rax to xmm0 |
|
|
|
|
| movq xmm0, rax | 66 48 0f 6e c0 | set xmm0 to rax |
|
|
|
|
| movq xmm1, rax | 66 48 0f 6e c8 | set xmm1 to rax |
|
2022-02-11 14:34:54 -05:00
|
|
|
| movq xmm1, xmm0 | f3 0f 7e c8 | set xmm1 to xmm0 |
|
2022-02-10 21:09:52 -05:00
|
|
|
| cvtss2sd xmm0, xmm0 | f3 0f 5a c0 | convert xmm0 from float to double |
|
|
|
|
| cvtsd2ss xmm0, xmm0 | f2 0f 5a c0 | convert xmm0 from double to float |
|
|
|
|
| cvttsd2si rax, xmm0 | f2 48 0f 2c c0 | convert double in xmm0 to int in rax |
|
|
|
|
| cvtsi2sd xmm0, rax | f2 48 0f 2a c0 | convert int in rax to double in xmm0 |
|
2022-02-11 13:52:19 -05:00
|
|
|
| comisd xmm0, xmm1 | 66 0f 2f c1 | compare xmm0 and xmm1 |
|
2022-02-11 14:34:54 -05:00
|
|
|
| addsd xmm0, xmm1 | f2 0f 58 c1 | add xmm1 to xmm0 |
|
|
|
|
| subsd xmm0, xmm1 | f2 0f 5c c1 | subtract xmm1 from xmm0 |
|
2022-02-12 16:57:40 -05:00
|
|
|
| mulsd xmm0, xmm1 | f2 0f 59 c1 | multiply xmm0 by xmm1 |
|
|
|
|
| divsd xmm0, xmm1 | f2 0f 5e c1 | divide xmm0 by xmm1 |
|
2022-01-07 20:30:29 -05:00
|
|
|
│ call rax │ ff d0 │ call the function at address rax │
|
|
|
|
│ ret │ c3 │ return from function │
|
|
|
|
│ syscall │ 0f 05 │ execute a system call │
|
|
|
|
│ nop │ 90 │ do nothing │
|
|
|
|
└──────────────────────┴───────────────────┴────────────────────────────────────────┘
|
2022-02-13 15:07:26 -05:00
|
|
|
|
|
|
|
SYSCALLS
|
|
|
|
Arguments are passed in
|
|
|
|
rdi, rsi, rdx, r10, r8, r9
|
|
|
|
The return value is placed in rax.
|
2022-02-27 15:31:02 -05:00
|
|
|
The values of rsp, rbp and rbx are preserved, but other registers might change.
|
2022-01-07 20:30:29 -05:00
|
|
|
```
|
|
|
|
|
2021-08-31 02:10:17 -04:00
|
|
|
## license
|
|
|
|
|
2022-02-23 23:50:49 -08:00
|
|
|
This does not apply to tcc's or musl's source code.
|
2022-02-19 19:43:13 -08:00
|
|
|
|
2021-08-31 02:10:17 -04:00
|
|
|
```
|
|
|
|
This project is in the public domain. Any copyright protections from any law
|
2021-11-10 12:55:41 -05:00
|
|
|
are forfeited by the author(s). No warranty is provided, and the author(s)
|
|
|
|
shall not be held liable in connection with it.
|
2021-08-31 02:10:17 -04:00
|
|
|
```
|
|
|
|
|
|
|
|
## contributing
|
|
|
|
|
|
|
|
If you notice a mistake/want to clarify something, you can submit a pull request
|
2022-01-07 11:07:06 -05:00
|
|
|
via GitHub, or email `pommicket at pommicket.com`.
|