stage 00 readme done
This commit is contained in:
parent
9bcbd94e46
commit
d052391270
7 changed files with 525 additions and 174 deletions
383
00/README.md
Normal file
383
00/README.md
Normal file
|
@ -0,0 +1,383 @@
|
||||||
|
# stage 00
|
||||||
|
|
||||||
|
This directory contains the file `hexcompile`, a handwritten executable. It
|
||||||
|
takes input file `A` containing space/newline/[any character]-separated
|
||||||
|
hexadecimal numbers and outputs them as bytes to the file `B`. On 64-bit Linux,
|
||||||
|
try running `./hexcompile` from this directory (I've already provided an `A`
|
||||||
|
file), and you will get a file named `B` containing the text `Hello, world!`.
|
||||||
|
This stage is needed so that you can use your favorite text editor to write
|
||||||
|
executables by hand (which have bytes outside of ASCII/UTF-8). I wrote it with
|
||||||
|
a program called hexedit, which can be found on most Linux distributions. Only
|
||||||
|
64-bit Linux is supported, because each OS/architecture combination would need
|
||||||
|
its own separate executable. The executable is 632 bytes long, and you could
|
||||||
|
definitely make it smaller if you wanted to, especially if you didn't limit it
|
||||||
|
to the set of instructions I've decided on. Let's take a look at what's inside
|
||||||
|
(`od -t x1 -An hexcompile`):
|
||||||
|
|
||||||
|
```
|
||||||
|
7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
|
||||||
|
02 00 3e 00 01 00 00 00 78 00 40 00 00 00 00 00
|
||||||
|
40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||||
|
00 00 00 00 40 00 38 00 01 00 00 00 00 00 00 00
|
||||||
|
01 00 00 00 07 00 00 00 78 00 00 00 00 00 00 00
|
||||||
|
78 00 40 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||||
|
00 02 00 00 00 00 00 00 00 02 00 00 00 00 00 00
|
||||||
|
00 10 00 00 00 00 00 00 48 b8 74 02 40 00 00 00
|
||||||
|
00 00 48 89 c7 48 b8 00 00 00 00 00 00 00 00 48
|
||||||
|
89 c6 48 89 c2 48 b8 02 00 00 00 00 00 00 00 0f
|
||||||
|
05 48 89 c5 48 b8 76 02 40 00 00 00 00 00 48 89
|
||||||
|
c7 48 b8 41 00 00 00 00 00 00 00 48 89 c6 48 b8
|
||||||
|
a4 01 00 00 00 00 00 00 48 89 c2 48 b8 02 00 00
|
||||||
|
00 00 00 00 00 0f 05 48 89 ef 48 b8 68 02 40 00
|
||||||
|
00 00 00 00 48 89 c6 48 b8 03 00 00 00 00 00 00
|
||||||
|
00 48 89 c2 48 b8 00 00 00 00 00 00 00 00 0f 05
|
||||||
|
48 89 c3 48 b8 03 00 00 00 00 00 00 00 48 39 d8
|
||||||
|
0f 8f 37 01 00 00 48 b8 68 02 40 00 00 00 00 00
|
||||||
|
48 89 c3 48 8b 03 48 89 c3 48 89 c7 48 b8 ff 00
|
||||||
|
00 00 00 00 00 00 48 21 d8 48 89 c6 48 b8 39 00
|
||||||
|
00 00 00 00 00 00 48 89 c3 48 89 f0 48 39 d8 0f
|
||||||
|
8f 1e 00 00 00 48 b8 30 00 00 00 00 00 00 00 48
|
||||||
|
f7 d8 48 89 f3 48 01 d8 e9 26 00 00 00 00 00 00
|
||||||
|
00 00 00 48 b8 a9 ff ff ff ff ff ff ff 48 89 f3
|
||||||
|
48 01 d8 e9 0b 00 00 00 00 00 00 00 00 00 00 00
|
||||||
|
00 00 00 48 89 c2 48 b8 ff 00 00 00 00 00 00 00
|
||||||
|
48 89 c3 48 89 f8 48 c1 e8 08 48 21 d8 48 93 48
|
||||||
|
b8 39 00 00 00 00 00 00 00 48 93 48 39 d8 0f 8f
|
||||||
|
1f 00 00 00 48 89 c3 48 b8 d0 ff ff ff ff ff ff
|
||||||
|
ff 48 01 d8 e9 2a 00 00 00 00 00 00 00 00 00 00
|
||||||
|
00 00 00 48 89 c3 48 b8 a9 ff ff ff ff ff ff 48
|
||||||
|
01 d8 e9 0c 00 00 00 00 00 00 00 00 00 00 00 00
|
||||||
|
00 00 00 48 89 c7 48 89 d0 48 c1 e0 04 48 89 fb
|
||||||
|
48 09 d8 48 93 48 b8 68 02 40 00 00 00 00 00 48
|
||||||
|
93 48 89 03 48 89 de 48 b8 04 00 00 00 00 00 00
|
||||||
|
00 48 89 c7 48 b8 01 00 00 00 00 00 00 00 48 89
|
||||||
|
c2 0f 05 e9 8f fe ff ff 00 00 00 00 00 48 b8 3c
|
||||||
|
00 00 00 00 00 00 00 0f 05 00 00 00 00 00 00 00
|
||||||
|
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
||||||
|
00 00 00 00 41 00 42 00
|
||||||
|
```
|
||||||
|
|
||||||
|
Okay, that doesn't tell us much. I'll annotate it below. You might notice that
|
||||||
|
all the numbers are backwards, e.g. `3e 00` for the number 0x003e (62 decimal).
|
||||||
|
This is because almost all modern architectures (including x86-64) are
|
||||||
|
little-endian, meaning that the *least significant byte* goes first, and the
|
||||||
|
most significant byte goes last. There are various reasons why this is easier to
|
||||||
|
deal with, but I won't explain that here.
|
||||||
|
|
||||||
|
## ELF header
|
||||||
|
This header has a bunch of metadata about the executable.
|
||||||
|
|
||||||
|
- `7f 45 4c 46` Special identifier saying that this is an ELF file (ELF is the
|
||||||
|
format of almost all Linux executables)
|
||||||
|
- `02` 64-bit
|
||||||
|
- `01` Little-endian
|
||||||
|
- `01` ELF version 1 (there is no version 2 yet)
|
||||||
|
- `00 00 00 00 00 00 00 00 00` Reserved (not important yet, but may be in a later
|
||||||
|
version of ELF)
|
||||||
|
- `02 00` Object type = executable file (not a dynamic library/etc.)
|
||||||
|
- `3e 00` Architecture x86-64
|
||||||
|
- `01 00 00 00` Version 1 of ELF, again
|
||||||
|
- `78 00 40 00 00 00 00 00` **Entry point of the executable** = 0x400078 (explained later)
|
||||||
|
- `40 00 00 00 00 00 00 00` Program header table offset in bytes from start of file (see below)
|
||||||
|
- `00 00 00 00 00 00 00 00` Section header table offset (we're not using sections)
|
||||||
|
- `00 00 00 00` Flags (not important)
|
||||||
|
- `40 00` The size of this header, in bytes = 64
|
||||||
|
- `38 00` Size of the program header (see below) = 56
|
||||||
|
- `01 00` Number of program headers = 1
|
||||||
|
- `00 00` Size of each section header (unused)
|
||||||
|
- `00 00` Number of section headers (unused)
|
||||||
|
- `00 00` Index of special .shstrtab section (unused)
|
||||||
|
|
||||||
|
## program header
|
||||||
|
The program header describes a segment of data that is loaded into memory when
|
||||||
|
the program starts. Normally, you would have more than one of these, maybe
|
||||||
|
one for code, one for read-only data, and one for read-write data, but to
|
||||||
|
simplify things we've only got one, which we'll use for any code and any data
|
||||||
|
we need. This means it'll have to be read-enabled, write-enabled, and
|
||||||
|
execute-enabled. Normally people don't do this, for security, but we won't worry
|
||||||
|
about that (don't compile any untrusted code with any compiler from this series!)
|
||||||
|
Without further ado, here's the contents of the program header:
|
||||||
|
|
||||||
|
- `01 00 00 00` Segment type 1 (this should be loaded into memory)
|
||||||
|
- `07 00 00 00` Flags = RWE (readable, writeable, and executable)
|
||||||
|
- `78 00 00 00 00 00 00 00` Offset in file = 120
|
||||||
|
- `78 00 40 00 00 00 00 00` Virtual address = 0x400078
|
||||||
|
|
||||||
|
**wait a minute, what's that?**
|
||||||
|
|
||||||
|
We just specified the *virtual address* of this segment. This is the virtual
|
||||||
|
memory address that the segment will be loaded to. Virtual memory means that
|
||||||
|
memory addresses in our program do not actually correspond to where the memory
|
||||||
|
is physically stored in RAM. There are many reasons for it, including allowing
|
||||||
|
different processes to have overlapping memory addresses, making sure that some
|
||||||
|
memory can't be read/written/executed, etc. You can read more about it
|
||||||
|
elsewhere.
|
||||||
|
- `00 00 00 00 00 00 00 00` Physical address (not applicable)
|
||||||
|
- `00 02 00 00 00 00 00 00` Size of this segment in the executable file = 512
|
||||||
|
bytes
|
||||||
|
- `00 02 00 00 00 00 00 00` Size of this segment when loaded into memory = also
|
||||||
|
512 bytes
|
||||||
|
- `00 10 00 00 00 00 00 00` Segment alignment = 4096 bytes
|
||||||
|
|
||||||
|
That last field, segment alignment, is needed, because on default-settings Linux
|
||||||
|
each page (block) of memory is 4096 bytes long, and has to start at an address
|
||||||
|
that is a multiple of 4096. Our program needs to be loaded into a memory page,
|
||||||
|
so its *virtual address* needs to be a multiple of 4096. We're using `0x400000`.
|
||||||
|
But wait! Didn't we use `0x400078` for the virtual address? Well, yes but that's
|
||||||
|
because the *data in the file* is loaded to address `0x400078`. The actual page
|
||||||
|
of memory that the OS will allocate for our code will start at `0x400000`. The
|
||||||
|
reason we need to start `0x78` bytes in is that Linux expects the data *in the
|
||||||
|
file* to be at the same position in the page as when it will be loaded, and it
|
||||||
|
appears at offset `0x78` in our file. Don't worry if you didn't understand all
|
||||||
|
of that.
|
||||||
|
|
||||||
|
## the code
|
||||||
|
|
||||||
|
Now we get to the actual code in our executable (well there's a bit of data here
|
||||||
|
too). We specified `0x400078` as the *entry point* of our executable, which
|
||||||
|
means that the program will start executing from there. That virtual address
|
||||||
|
corresponds to the start of the code right here:
|
||||||
|
|
||||||
|
The first thing we want to do is open our input file, `A`:
|
||||||
|
|
||||||
|
- `48 b8 74 02 40 00 00 00 00 00` `mov rax, 0x400274`
|
||||||
|
- `48 89 c7` `mov rdi, rax`
|
||||||
|
- `48 b8 00 00 00 00 00 00 00 00` `mov rax, 0`
|
||||||
|
- `48 89 c6` `mov rsi, rax`
|
||||||
|
- `48 89 c2` `mov rdx, rax`
|
||||||
|
- `48 b8 02 00 00 00 00 00 00 00` `mov rax, 2`
|
||||||
|
- `0f 05` `syscall`
|
||||||
|
|
||||||
|
These instructions execute syscall `2` with arguments `0x400274`, `0`, `0`.
|
||||||
|
If you're familiar with C code, this is `open("A", O_RDONLY, 0)`.
|
||||||
|
A syscall is the mechanism which lets software ask the kernel to do things.
|
||||||
|
[Here](https://filippo.io/linux-syscall-table/) is a nice table of syscalls you
|
||||||
|
can look through if you're interested.
|
||||||
|
Syscall #2, on Linux, is `open`. It's used to open a file. On Linux, you can
|
||||||
|
read about it by running `man 2 open`.
|
||||||
|
The first argument, `0x400274`, is a pointer to some data at the very end of
|
||||||
|
this segment (scroll down). Specifically, it holds the byte `41` (ASCII `A`),
|
||||||
|
followed by `00` (null byte). This indicates the name of the file, "A". The
|
||||||
|
second argument (`O_RDONLY`, or 0) specifies that we will be reading from this
|
||||||
|
file. The third is only really needed when creating new files, but I've just
|
||||||
|
set it to 0, why not.
|
||||||
|
|
||||||
|
This call gives us back a *file descriptor*, used later to read from the file,
|
||||||
|
in register `rax`.
|
||||||
|
|
||||||
|
- `48 89 c5` `mov rbp, rax` Store the file descriptor for later
|
||||||
|
|
||||||
|
Now we'll open the output file
|
||||||
|
|
||||||
|
- `48 b8 76 02 40 00 00 00 00 00` `mov rax, 0x400276`
|
||||||
|
- `48 89 c7` `mov rdi, rax`
|
||||||
|
- `48 b8 41 00 00 00 00 00 00 00` `mov rax, 0x41`
|
||||||
|
- `48 89 c6` `mov rsi, rax`
|
||||||
|
- `48 b8 a4 01 00 00 00 00 00 00` `mov rax, 0o644`
|
||||||
|
- `48 89 c2` `mov rdx, rax`
|
||||||
|
- `48 b8 02 00 00 00 00 00 00 00` `mov rax, 2`
|
||||||
|
- `0f 05` `syscall`
|
||||||
|
|
||||||
|
These instructions execute the syscall `open("B", O_WRONLY|O_CREAT, 0644)`. This
|
||||||
|
is similar to our first one, but with some important differences. First, the
|
||||||
|
second argument specifies both that we are writing to a file `0x01`, and that we
|
||||||
|
want to create the file if it doesn't exist `0x40`. Secondly, the third
|
||||||
|
argument specifies the permissions that the file should be created with (`644` -
|
||||||
|
user read/write, group read). This here isn't particularly important to how the
|
||||||
|
program works.
|
||||||
|
|
||||||
|
- `48 89 ef` `mov rdi, rbp`
|
||||||
|
- `48 b8 68 02 40 00 00 00 00 00` `mov rax, 0x400268`
|
||||||
|
- `48 89 c6` `mov rsi, rax`
|
||||||
|
- `48 b8 03 00 00 00 00 00 00 00` `mov rax, 3`
|
||||||
|
- `48 89 c2` `mov rdx, rax`
|
||||||
|
- `48 b8 00 00 00 00 00 00 00 00` `mov rax, 0`
|
||||||
|
- `0f 05` `syscall`
|
||||||
|
|
||||||
|
Here we call syscall #0 (`read`) to read from a file. The arguments are:
|
||||||
|
- `fd (rdi) = rbp` read from the file descriptor we stored away earlier
|
||||||
|
- `buf (rsi) = 0x400268` output to a part of this segment I've left empty
|
||||||
|
- `count (rdx) = 3` read 3 bytes
|
||||||
|
|
||||||
|
The number of bytes *actually* read (taking into account the fact that we might
|
||||||
|
have reached the end of the file) is stored in `rax`.
|
||||||
|
|
||||||
|
Note that we read the entire file 3 bytes at a time, which is a *terrible* idea
|
||||||
|
for performance. syscalls take quite a while (3 microseconds or so, which would
|
||||||
|
make this very slow for a several-megabyte file), so modern programs tend to
|
||||||
|
read ~4KB at a time. But our programs will be small, and we don't care a lot
|
||||||
|
about performance, so it's okay.
|
||||||
|
|
||||||
|
- `48 89 c3` `mov rbx, rax`
|
||||||
|
- `48 b8 03 00 00 00 00 00 00 00` `mov rax, 3`
|
||||||
|
- `48 39 d8` `cmp rax, rbx`
|
||||||
|
- `0f 8f 37 01 00 00` `jg 0x40024d`
|
||||||
|
|
||||||
|
Together, these instructions say to jump to a different part of the code
|
||||||
|
(explained later), if we ended up reading less than 3 bytes, i.e. we reached the
|
||||||
|
end of the file. Note that rather than specifying the *address* to jump to, we
|
||||||
|
specify the *relative address* (it's relative to the address of the first byte
|
||||||
|
after the jump instruction). In other words, we're adding `0x137` to the program
|
||||||
|
counter, `rip`. This has many reasons including saving space.
|
||||||
|
|
||||||
|
- `48 b8 68 02 40 00 00 00 00 00` `mov rax, 0x400268`
|
||||||
|
- `48 89 c3` `mov rbx, rax`
|
||||||
|
- `48 8b 03` `mov rax, qword [rbx]`
|
||||||
|
|
||||||
|
This copies out 8 bytes of the data that was just read into the 64-bit register
|
||||||
|
rax. We only read 3 bytes of data from the file, but the rest will just be
|
||||||
|
zeros (because that's what we put at offset `0x268` of the file).
|
||||||
|
|
||||||
|
- `48 89 c3` `mov rbx, rax`
|
||||||
|
- `48 89 c7` `mov rdi, rax`
|
||||||
|
|
||||||
|
Here we copy away this data for later use.
|
||||||
|
|
||||||
|
- `48 b8 ff 00 00 00 00 00 00 00` `mov rax, 0xff`
|
||||||
|
- `48 21 d8` `and rax, rbx`
|
||||||
|
|
||||||
|
This grabs the first byte of data we read and stores it in `rax`. This will be
|
||||||
|
the code of the first ASCII character of the hexadecimal number in our input
|
||||||
|
file.
|
||||||
|
|
||||||
|
- `48 89 c6` `mov rsi, rax`
|
||||||
|
- `48 b8 39 00 00 00 00 00 00 00` `mov rax, 0x39 ('9')`
|
||||||
|
- `48 89 c3` `mov rax, rbx`
|
||||||
|
- `48 89 f0` `mov rax, rsi`
|
||||||
|
- `48 39 d8` `cmp rax, rbx`
|
||||||
|
- `0f 8f 1e 00 00 00` `jg 0x400173`
|
||||||
|
|
||||||
|
These instructions compare that character code against the character code for
|
||||||
|
`9`. If it's greater, then it's one of the hex digits `a` through `f`, which are
|
||||||
|
handled separately later.
|
||||||
|
|
||||||
|
- `48 b8 30 00 00 00 00 00 00 00` `mov rax, 0x30 ('0')`
|
||||||
|
- `48 f7 d8` `neg rax`
|
||||||
|
- `48 89 f3` `mov rbx, rsi`
|
||||||
|
- `48 01 d8` `add rax, rbx`
|
||||||
|
|
||||||
|
Subtract the character code for `0` from the character code we read in, to get
|
||||||
|
the *number* corresponding to the first hex digit in the pair.
|
||||||
|
|
||||||
|
- `e9 26 00 00 00` `jmp 0x400193`
|
||||||
|
|
||||||
|
Go to a different part of the program (we'll get there later).
|
||||||
|
|
||||||
|
- `00 00 00 00 00 00`
|
||||||
|
|
||||||
|
Unneeded 0 bytes I left in, to make room in case I needed it.
|
||||||
|
|
||||||
|
Now we get to the `a`-`f` handling code:
|
||||||
|
|
||||||
|
- `48 b8 a9 ff ff ff ff ff ff ff` `mov rax, -87`
|
||||||
|
- `48 89 f3` `mov rbx, rsi`
|
||||||
|
- `48 01 d8` `add rax, rbx`
|
||||||
|
- `e9 0b 00 00 00` `jmp 0x400193`
|
||||||
|
- `00 00 00 00 00 00 00 00 00 00 00` (unused)
|
||||||
|
|
||||||
|
If our character code is one of `abcdef`, we add `-87` (subtract `87`) from it,
|
||||||
|
to convert the character code to the numerical value of the digit. Here I
|
||||||
|
decided to just set `rax` to the two's complement encoding for `-87`, but you
|
||||||
|
could also use the `neg` instruction, like I did last time. <s>I just wanted to
|
||||||
|
show two different ways of doing it</s> I thought of the better way the second
|
||||||
|
time around.
|
||||||
|
|
||||||
|
Now we get to `0x400193`, the common place we jumped to from both branches.
|
||||||
|
|
||||||
|
- `48 89 c2` `mov rdx, rax`
|
||||||
|
|
||||||
|
Store away the first digit in the pair into `rdx`.
|
||||||
|
|
||||||
|
- `48 b8 ff 00 00 00 00 00 00 00` `mov rax, 0xff`
|
||||||
|
- `48 89 c3` `mov rbx, rax`
|
||||||
|
- `48 89 f8` `mov rax, rdi`
|
||||||
|
- `48 c1 e8 08` `shr rax, 8`
|
||||||
|
- `48 21 d8` `and rax, rbx`
|
||||||
|
|
||||||
|
Now we extract the second character code we read from the file.
|
||||||
|
The entire character code to number conversion is rewritten here, but slightly
|
||||||
|
differently this time because I came up with some new ideas.
|
||||||
|
|
||||||
|
- `48 93` `xchg rax, rbx`
|
||||||
|
- `48 b8 39 00 00 00 00 00 00 00` `mov rax, 0x39 ('9')`
|
||||||
|
- `48 93` `xchg rax, rbx`
|
||||||
|
- `48 39 d8` `cmp rax, rbx`
|
||||||
|
- `0f 8f 1f 00 00 00` `jg 0x4001e3 ('a'-'f' handling code)`
|
||||||
|
- `48 89 c3` `mov rbx, rax`
|
||||||
|
- `48 b8 d0 ff ff ff ff ff ff ff` `mov rax, -48`
|
||||||
|
- `48 01 d8` `add rax, rbx`
|
||||||
|
- `e9 2a 00 00 00` `jmp 0x400203`
|
||||||
|
- `00 00 00 00 00 00 00 00 00 00` (unused)
|
||||||
|
|
||||||
|
('a'-'f' handling)
|
||||||
|
- `48 89 c3` `mov rbx, rax`
|
||||||
|
- `48 b8 a9 ff ff ff ff ff ff` `mov rax, -87`
|
||||||
|
- `48 01 d8` `add rax, rbx`
|
||||||
|
- `e9 0c 00 00` `jmp 0x400203`
|
||||||
|
- `00 00 00 00 00 00 00 00 00 00 00 00 00` (unused)
|
||||||
|
|
||||||
|
(common code)
|
||||||
|
- `48 89 c7` `mov rdi, rax`
|
||||||
|
|
||||||
|
Okay now we've read the first hex digit into `rdx`, and the second into `rdi`.
|
||||||
|
|
||||||
|
- `48 89 d0` `mov rax, rdx`
|
||||||
|
- `48 c1 e0 04` `shl rax, 4`
|
||||||
|
- `48 89 fb` `mov rbx, rsi`
|
||||||
|
- `48 09 d8` `or rax, rbx`
|
||||||
|
|
||||||
|
Okay, now we have the full hexadecimal number in `rax`!
|
||||||
|
|
||||||
|
- `48 93` `xchg rax, rbx`
|
||||||
|
- `48 b8 68 02 40 00 00 00 00 00` `mov rax, 0x400268`
|
||||||
|
- `48 93` `xchg rax, rbx`
|
||||||
|
- `48 89 03` `mov qword [rbx], rax`
|
||||||
|
|
||||||
|
This stores the byte we want to write to the file at address `0x400268`. This is
|
||||||
|
the same address we used to read in the input text; again, it's just part of
|
||||||
|
this segment I've left blank.
|
||||||
|
|
||||||
|
- `48 89 de` `mov rsi, rbx`
|
||||||
|
- `48 b8 04 00 00 00 00 00 00 00` `mov rax, 4`
|
||||||
|
- `48 89 c7` `mov rdi, rax`
|
||||||
|
- `48 b8 01 00 00 00 00 00 00 00` `mov rax, 1`
|
||||||
|
- `48 89 c2` `mov rdx, rax`
|
||||||
|
- `0f 05` `syscall`
|
||||||
|
|
||||||
|
Here we call syscall #1, `write`, with arguments:
|
||||||
|
|
||||||
|
- `fd = 4` we could have stored away the file descriptor we got before for the
|
||||||
|
output file, like we did with the input file, but I was out of easy-to-use
|
||||||
|
registers! Instead, we can use the fact that Linux assigns file descriptors
|
||||||
|
sequentially starting from 3 (0, 1, and 2 are standard input, output, and
|
||||||
|
error), so we know our output file, the second file we opened, will have
|
||||||
|
descriptor 4.
|
||||||
|
- `buf = 0x400268` where we put our data
|
||||||
|
- `count = 1` write 1 byte
|
||||||
|
|
||||||
|
- `e9 8f fe ff ff` `jmp 0x4000d7`
|
||||||
|
- `00 00 00 00 00` (unused)
|
||||||
|
|
||||||
|
Now we go back to read in the next pair of digits! Finally...
|
||||||
|
|
||||||
|
- `48 b8 3c 00 00 00 00 00 00 00` `mov rax, 0x3c`
|
||||||
|
- `0f 05` `syscall`
|
||||||
|
|
||||||
|
This is where we conditionally jumped to way back when we determined if we
|
||||||
|
reached the end of the file. This just calls syscall #60, `exit`, to exit our
|
||||||
|
program nicely. We didn't specify the exit code, but that's okay for our
|
||||||
|
purposes.
|
||||||
|
And we could close the files (syscall #3), to tell Linux we're done with them,
|
||||||
|
but we don't need to. It'll close all our open file descriptors when our program
|
||||||
|
exits.
|
||||||
|
|
||||||
|
|
||||||
|
- `00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00` Unused bytes (I wasn't
|
||||||
|
sure exactly how long the program would be)
|
||||||
|
- `00 00 00 00 00 00 00 00` This is where we read/wrote the file data!
|
||||||
|
- `41 00` Input file name, `"A"`
|
||||||
|
- `42 00` Output file name, `"B"`
|
||||||
|
|
||||||
|
That's quite a lot to take in for such a simple program, but here we are! We now
|
||||||
|
have something that will let us write individual bytes with an ordinary text
|
||||||
|
editor and get them translated into a binary file.
|
147
00/README.txt
147
00/README.txt
|
@ -1,147 +0,0 @@
|
||||||
--- stage 00 ---
|
|
||||||
|
|
||||||
This directory contains the file 'hexcompile', a handwritten executable.
|
|
||||||
It takes an input file A containing space/newline/[any character]-separated
|
|
||||||
hexadecimal numbers and outputs them as bytes to the file B. On 64-bit Linux,
|
|
||||||
try running ./hexcompile from this directory (I've already provided an A file),
|
|
||||||
and you will get a file named B containing the text "Hello, world!".
|
|
||||||
I made this program so that you can use your favorite text editor to write
|
|
||||||
executables by hand (which have bytes outside of ASCII/UTF-8).
|
|
||||||
I wrote it with a program called hexedit, which can be found on most Linux
|
|
||||||
distributions. Only 64-bit Linux is supported, because each OS/architecture
|
|
||||||
combination would need its own separate executable. The executable is 632 bytes
|
|
||||||
long, and you could definitely make it smaller if you wanted to. Let's take a
|
|
||||||
look at what's inside (see hexdump -C hexcompile):
|
|
||||||
7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
|
|
||||||
02 00 3e 00 01 00 00 00 78 00 40 00 00 00 00 00
|
|
||||||
40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
|
||||||
00 00 00 00 40 00 38 00 01 00 00 00 00 00 00 00
|
|
||||||
01 00 00 00 07 00 00 00 78 00 00 00 00 00 00 00
|
|
||||||
78 00 40 00 00 00 00 00 00 00 00 00 00 00 00 00
|
|
||||||
00 02 00 00 00 00 00 00 00 02 00 00 00 00 00 00
|
|
||||||
00 10 00 00 00 00 00 00 48 b8 74 02 40 00 00 00
|
|
||||||
00 00 48 89 c7 48 b8 00 00 00 00 00 00 00 00 48
|
|
||||||
89 c6 48 89 c2 48 b8 02 00 00 00 00 00 00 00 0f
|
|
||||||
05 48 89 c5 48 b8 76 02 40 00 00 00 00 00 48 89
|
|
||||||
c7 48 b8 41 00 00 00 00 00 00 00 48 89 c6 48 b8
|
|
||||||
a4 01 00 00 00 00 00 00 48 89 c2 48 b8 02 00 00
|
|
||||||
00 00 00 00 00 0f 05 48 89 c1 48 89 ef 48 b8 68
|
|
||||||
02 40 00 00 00 00 00 48 89 c6 48 b8 03 00 00 00
|
|
||||||
00 00 00 00 48 89 c2 48 b8 00 00 00 00 00 00 00
|
|
||||||
00 0f 05 48 89 c3 48 b8 03 00 00 00 00 00 00 00
|
|
||||||
48 39 d8 0f 8f 37 01 00 00 48 b8 68 02 40 00 00
|
|
||||||
00 00 00 48 89 c3 48 8b 03 48 89 c3 48 89 c7 48
|
|
||||||
b8 ff 00 00 00 00 00 00 00 48 21 d8 48 89 c6 48
|
|
||||||
b8 39 00 00 00 00 00 00 00 48 89 c3 48 89 f0 48
|
|
||||||
39 d8 0f 8f 1e 00 00 00 48 b8 30 00 00 00 00 00
|
|
||||||
00 00 48 f7 d8 48 89 f3 48 01 d8 e9 26 00 00 00
|
|
||||||
00 00 00 00 00 00 48 b8 a9 ff ff ff ff ff ff ff
|
|
||||||
48 89 f3 48 01 d8 e9 0b 00 00 00 00 00 00 00 00
|
|
||||||
00 00 00 00 00 00 48 89 c2 48 b8 ff 00 00 00 00
|
|
||||||
00 00 00 48 89 c3 48 89 f8 48 c1 e8 08 48 21 d8
|
|
||||||
48 93 48 b8 39 00 00 00 00 00 00 00 48 93 48 39
|
|
||||||
d8 0f 8f 1f 00 00 00 48 89 c3 48 b8 d0 ff ff ff
|
|
||||||
ff ff ff ff 48 01 d8 e9 2a 00 00 00 00 00 00 00
|
|
||||||
00 00 00 00 00 00 48 89 c3 48 b8 a9 ff ff ff ff
|
|
||||||
ff ff 48 01 d8 e9 0c 00 00 00 00 00 00 00 00 00
|
|
||||||
00 00 00 00 00 00 48 89 c7 48 89 d0 48 c1 e0 04
|
|
||||||
48 89 fb 48 09 d8 48 93 48 b8 68 02 40 00 00 00
|
|
||||||
00 00 48 93 48 89 03 48 89 de 48 b8 04 00 00 00
|
|
||||||
00 00 00 00 48 89 c7 48 b8 01 00 00 00 00 00 00
|
|
||||||
00 48 89 c2 0f 05 e9 8f fe ff ff 00 00 00 00 00
|
|
||||||
48 b8 3c 00 00 00 00 00 00 00 0f 05 00 00 00 00
|
|
||||||
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
|
||||||
00 00 00 00 41 00 42 00
|
|
||||||
|
|
||||||
Okay, that doesn't tell us much, I'll annotate it below. You might notice that
|
|
||||||
all the numbers are backwards, e.g. 3e 00 for the number 0x003e (62 decimal).
|
|
||||||
This is because almost all modern architectures (including x86-64) are
|
|
||||||
little-endian, meaning that the *least significant byte* goes first, and the
|
|
||||||
most significant byte goes last. There are various reasons why this is easier to
|
|
||||||
deal with, which I won't explain here.
|
|
||||||
|
|
||||||
-- ELF header --
|
|
||||||
This header has a bunch of metadata about the executable.
|
|
||||||
|
|
||||||
7f 45 4c 46 - Special identifier saying that this is an ELF file (ELF is the
|
|
||||||
format of almost all Linux executables)
|
|
||||||
02 - 64-bit
|
|
||||||
01 - Little-endian
|
|
||||||
01 - ELF version 1 (there is no version 2 yet)
|
|
||||||
00 00 00 00 00 00 00 00 00 - Reserved (not important yet, but may be in a later
|
|
||||||
version of ELF)
|
|
||||||
02 00 - This is an executable file (not a dynamic library/etc)
|
|
||||||
3e 00 - Architecture x86-64
|
|
||||||
01 00 00 00 - Version 1 of ELF (minor version or something)
|
|
||||||
78 00 40 00 00 00 00 00 - **Entry point of the executable** = 0x400078 (explained later)
|
|
||||||
40 00 00 00 00 00 00 00 - Program header table offset in bytes from start of file (see below)
|
|
||||||
00 00 00 00 00 00 00 00 - Section header table offset (we're not using sections)
|
|
||||||
00 00 00 00 - Flags (not important)
|
|
||||||
40 00 - The size of this header, in bytes = 64
|
|
||||||
38 00 - Size of the program header (see below) = 56
|
|
||||||
01 00 - Number of program headers = 1
|
|
||||||
00 00 - Size of each section header (unused)
|
|
||||||
00 00 - Number of section headers (unused)
|
|
||||||
00 00 - Index of special .shstrtab section (unused)
|
|
||||||
|
|
||||||
-- Program header --
|
|
||||||
The program header describes a segment of data that is loaded into memory when
|
|
||||||
the program starts. Normally, you would have more than one of these, one for
|
|
||||||
code, one for read-only data, and one for read-write data, perhaps, but to
|
|
||||||
simplify things we've only got one, which we'll use for any code and any data
|
|
||||||
we need. This means it'll have to be read-enabled, write-enabled, *and*
|
|
||||||
execute-enabled. Normally people don't do this, for security, but we won't worry
|
|
||||||
about that (don't compile any untrusted code with any compiler from this series!)
|
|
||||||
Without further ado, here's the contents of the program header:
|
|
||||||
|
|
||||||
01 00 00 00 - Segment type 1 (this should be loaded into memory)
|
|
||||||
07 00 00 00 - Flags = RWE (readable, writeable, and executable)
|
|
||||||
78 00 00 00 00 00 00 00 - Offset in file = 120
|
|
||||||
78 00 40 00 00 00 00 00 - Virtual address = 0x400078
|
|
||||||
- Wait a minute, what's that? -
|
|
||||||
We just specified the *virtual address* of this segment. This is the virtual
|
|
||||||
memory address that the segment will be loaded to. Virtual memory means that
|
|
||||||
memory addresses in our program do not actually correspond to where the memory
|
|
||||||
is physically stored in RAM. There are many reasons for it, including allowing
|
|
||||||
different processes to have overlapping memory addresses, making sure that some
|
|
||||||
memory can't be read/written/executed, etc. You can read more about it
|
|
||||||
elsewhere.
|
|
||||||
00 00 00 00 00 00 00 00 - Physical address (not applicable)
|
|
||||||
00 02 00 00 00 00 00 00 - Size of this segment in the executable file = 512
|
|
||||||
bytes
|
|
||||||
00 02 00 00 00 00 00 00 - Size of this segment when loaded into memory = also
|
|
||||||
512 bytes
|
|
||||||
00 10 00 00 00 00 00 00 - Segment alignment = 4096 bytes
|
|
||||||
48 b8 74 02 40 00 00 00
|
|
||||||
00 00 48 89 c7 48 b8 00 00 00 00 00 00 00 00 48
|
|
||||||
89 c6 48 89 c2 48 b8 02 00 00 00 00 00 00 00 0f
|
|
||||||
05 48 89 c5 48 b8 76 02 40 00 00 00 00 00 48 89
|
|
||||||
c7 48 b8 41 00 00 00 00 00 00 00 48 89 c6 48 b8
|
|
||||||
a4 01 00 00 00 00 00 00 48 89 c2 48 b8 02 00 00
|
|
||||||
00 00 00 00 00 0f 05 48 89 c1 48 89 ef 48 b8 68
|
|
||||||
02 40 00 00 00 00 00 48 89 c6 48 b8 03 00 00 00
|
|
||||||
00 00 00 00 48 89 c2 48 b8 00 00 00 00 00 00 00
|
|
||||||
00 0f 05 48 89 c3 48 b8 03 00 00 00 00 00 00 00
|
|
||||||
48 39 d8 0f 8f 37 01 00 00 48 b8 68 02 40 00 00
|
|
||||||
00 00 00 48 89 c3 48 8b 03 48 89 c3 48 89 c7 48
|
|
||||||
b8 ff 00 00 00 00 00 00 00 48 21 d8 48 89 c6 48
|
|
||||||
b8 39 00 00 00 00 00 00 00 48 89 c3 48 89 f0 48
|
|
||||||
39 d8 0f 8f 1e 00 00 00 48 b8 30 00 00 00 00 00
|
|
||||||
00 00 48 f7 d8 48 89 f3 48 01 d8 e9 26 00 00 00
|
|
||||||
00 00 00 00 00 00 48 b8 a9 ff ff ff ff ff ff ff
|
|
||||||
48 89 f3 48 01 d8 e9 0b 00 00 00 00 00 00 00 00
|
|
||||||
00 00 00 00 00 00 48 89 c2 48 b8 ff 00 00 00 00
|
|
||||||
00 00 00 48 89 c3 48 89 f8 48 c1 e8 08 48 21 d8
|
|
||||||
48 93 48 b8 39 00 00 00 00 00 00 00 48 93 48 39
|
|
||||||
d8 0f 8f 1f 00 00 00 48 89 c3 48 b8 d0 ff ff ff
|
|
||||||
ff ff ff ff 48 01 d8 e9 2a 00 00 00 00 00 00 00
|
|
||||||
00 00 00 00 00 00 48 89 c3 48 b8 a9 ff ff ff ff
|
|
||||||
ff ff 48 01 d8 e9 0c 00 00 00 00 00 00 00 00 00
|
|
||||||
00 00 00 00 00 00 48 89 c7 48 89 d0 48 c1 e0 04
|
|
||||||
48 89 fb 48 09 d8 48 93 48 b8 68 02 40 00 00 00
|
|
||||||
00 00 48 93 48 89 03 48 89 de 48 b8 04 00 00 00
|
|
||||||
00 00 00 00 48 89 c7 48 b8 01 00 00 00 00 00 00
|
|
||||||
00 48 89 c2 0f 05 e9 8f fe ff ff 00 00 00 00 00
|
|
||||||
48 b8 3c 00 00 00 00 00 00 00 0f 05 00 00 00 00
|
|
||||||
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
|
||||||
00 00 00 00 41 00 42 00
|
|
BIN
00/hexcompile
BIN
00/hexcompile
Binary file not shown.
99
README.md
Normal file
99
README.md
Normal file
|
@ -0,0 +1,99 @@
|
||||||
|
# boostrapping a (Linux x86-64) C compiler
|
||||||
|
|
||||||
|
Compilers nowadays are written in languages like C, which themselves need to be
|
||||||
|
compiled. But then, you need a C compiler to compile your C compiler! Of course,
|
||||||
|
the very first C compiler was not written in C (because how would it be
|
||||||
|
compiled?). Instead, it was slowly built up, starting from a very basic
|
||||||
|
assembler, eventually reacing a full-scale compiler. This process is known as
|
||||||
|
bootstrapping. In this repository, we'll explore how that's done. Each directory
|
||||||
|
represents a new "stage" in the process. The first one, `00`, is a hand-written
|
||||||
|
executable, and the last one will be a C compiler. Each directory has its own
|
||||||
|
README explaining what's going on.
|
||||||
|
|
||||||
|
You can run `bootstrap.sh` to run through and test every stage.
|
||||||
|
|
||||||
|
## the basics
|
||||||
|
|
||||||
|
In this series, I want to explain *everything* that's going on. I'm going to
|
||||||
|
need to assume some passing knowledge about computers, but here's a quick
|
||||||
|
overview of what you'll want to know before starting. I can't explain everything
|
||||||
|
so you may need to do your own research. You don't need to understand each of
|
||||||
|
these in full, just get a general idea at least:
|
||||||
|
|
||||||
|
- what an operating system is
|
||||||
|
- what memory is
|
||||||
|
- what a programming language is
|
||||||
|
- what a compiler is
|
||||||
|
- what an executable file is
|
||||||
|
- number bases -- if a number is preceded by 0x, 0o, or 0b in this series, that
|
||||||
|
means hexadecimal/octal/binary respectively. So 0xff = FF hexadecimal = 255
|
||||||
|
decimal.
|
||||||
|
- what a CPU is
|
||||||
|
- what a CPU architecture is
|
||||||
|
- what a CPU register is
|
||||||
|
- what a pointer is
|
||||||
|
- bits, bytes, kilobytes, etc.
|
||||||
|
- bitwise operations (not, or, and, xor, left shift, right shift)
|
||||||
|
- 2's complement
|
||||||
|
- null-terminated strings
|
||||||
|
- how floating-point numbers work
|
||||||
|
- maybe some basic Intel-style x86-64 assembly (you can probably pick it up on
|
||||||
|
the way though)
|
||||||
|
|
||||||
|
|
||||||
|
## instruction set
|
||||||
|
|
||||||
|
x86-64 has a *gigantic* instruction set. The manual for it is over 2,000 pages
|
||||||
|
long! So, it makes sense to select only a small subset of it to use for all the
|
||||||
|
stages of our compiler. The set I've chosen can be found in `instructions.txt`.
|
||||||
|
I think it achieves a pretty good balance between having few enough
|
||||||
|
instructions to be manageable and having enough instructions to be useable.
|
||||||
|
To be clear, you don't need to read that file to understand the series, at least
|
||||||
|
not right away.
|
||||||
|
|
||||||
|
## principles
|
||||||
|
|
||||||
|
- as simple as possible
|
||||||
|
|
||||||
|
Bootstrapping a compiler is not an easy task, so we're trying to make it as easy
|
||||||
|
as possible. We don't even necessarily need a standard-compliant C compiler, we
|
||||||
|
only need enough to compile someone else's C compiler, specifically TCC
|
||||||
|
(https://bellard.org/tcc/) since that's a compiler with very few dependencies.
|
||||||
|
|
||||||
|
- efficiency is not a concern
|
||||||
|
|
||||||
|
We will create big and slow executables, and that's okay. It doesn't really
|
||||||
|
matter if compiling TCC takes 8 as opposed to 0.01 seconds; once we compile TCC
|
||||||
|
with itself, we'll get the same executable either way.
|
||||||
|
|
||||||
|
## reflections on trusting trust
|
||||||
|
|
||||||
|
In 1984, Ken Thompson wrote the well-known article
|
||||||
|
[*Reflections on Trusting Trust*](http://users.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf).
|
||||||
|
This is one of the things that inspired me to start this project. To summarize
|
||||||
|
the article: it is possible to create a malicious C compiler which will
|
||||||
|
replicate its own malicious functionalities (e.g. detecting password-checking
|
||||||
|
routines to make them also accept another password the attacker knows) when used
|
||||||
|
to compile other C compilers. For all we know, such a compiler was used to
|
||||||
|
compile GCC, say, and so all programs around today could be compromised. Of
|
||||||
|
course, this is practically definitely not the case, but it's still an
|
||||||
|
interesting experiment to try to create a fully trustable compiler. This
|
||||||
|
project can't necessarily even do that though, because the Linux kernel, which
|
||||||
|
we depend on, is compiled from C, so we can't fully trust *it*. To *truly*
|
||||||
|
create a fully trustable compiler, you'd need to manually write to a USB with a
|
||||||
|
circuit, create an operating system from nothing (without even a text editor),
|
||||||
|
and then follow this series, or maybe you don't even trust your CPU vendor...
|
||||||
|
I'll leave that to someone else
|
||||||
|
|
||||||
|
## license
|
||||||
|
|
||||||
|
```
|
||||||
|
This project is in the public domain. Any copyright protections from any law
|
||||||
|
for this project are forfeited by the author(s). No warranty is provided for
|
||||||
|
this project, and the author(s) shall not be held liable in connection with it.
|
||||||
|
```
|
||||||
|
|
||||||
|
## contributing
|
||||||
|
|
||||||
|
If you notice a mistake/want to clarify something, you can submit a pull request
|
||||||
|
via GitHub, or email `pommicket at pommicket.com`. Translations are welcome!
|
25
README.txt
25
README.txt
|
@ -1,25 +0,0 @@
|
||||||
--- boostrapping a (Linux x86-64) C compiler ---
|
|
||||||
|
|
||||||
Compilers nowadays are written in languages like C, which themselves need to be
|
|
||||||
compiled. But then, you need a C compiler to compile your C compiler! Of course,
|
|
||||||
the very first C compiler was not written in C (because how would it be
|
|
||||||
compiled?). Instead, it was slowly built up, starting from a very basic
|
|
||||||
assembler, eventually reacing a full-scale compiler. This process is known as
|
|
||||||
bootstrapping. In this repository, we'll explore how that's done. Each directory
|
|
||||||
represents a new "stage" in the process. The first one, "00", is a hand-written
|
|
||||||
executable, and the last one will be a C compiler. Each directory has its own
|
|
||||||
README.txt explaining in full what's going on.
|
|
||||||
|
|
||||||
-- instruction set --
|
|
||||||
x86-64 has a *gigantic* instruction set. The manual for it is over 2,000 pages
|
|
||||||
long! So, it makes sense to select only a small subset of it to use for all the
|
|
||||||
stages of our compiler. The set I've chosen can be found in instructions.txt (a
|
|
||||||
work in progress). I think it achieves a pretty good balance between
|
|
||||||
having few enough instructions to be manageable and having enough
|
|
||||||
instructions to be useable.
|
|
||||||
|
|
||||||
-- license --
|
|
||||||
|
|
||||||
This software is in the public domain. Any copyright protections from any law
|
|
||||||
for this software are forfeited by the author(s). No warranty is provided for
|
|
||||||
this software, and the author(s) shall not be held liable in connection with it.
|
|
39
bootstrap.sh
Executable file
39
bootstrap.sh
Executable file
|
@ -0,0 +1,39 @@
|
||||||
|
#!/bin/sh
|
||||||
|
|
||||||
|
# check OS/architecture
|
||||||
|
|
||||||
|
esc() {
|
||||||
|
: # comment out the following line to disable color output
|
||||||
|
printf '\33[%dm' "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
echo_red() {
|
||||||
|
esc 31
|
||||||
|
echo "$1"
|
||||||
|
esc 0
|
||||||
|
}
|
||||||
|
|
||||||
|
echo_green() {
|
||||||
|
esc 32
|
||||||
|
echo "$1"
|
||||||
|
esc 0
|
||||||
|
}
|
||||||
|
|
||||||
|
if uname -a | grep -i 'x86_64' | grep -i -q 'linux'; then
|
||||||
|
: # all good
|
||||||
|
else
|
||||||
|
echo_red "Only 64-bit Linux is supported. This doesn't seem to be 64-bit Linux."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
cd 00
|
||||||
|
rm -f B
|
||||||
|
./hexcompile A
|
||||||
|
if [ "$(cat B)" != 'Hello, world!' ]; then
|
||||||
|
echo_red 'Stage 00 failed.'
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
rm -f B
|
||||||
|
cd ..
|
||||||
|
|
||||||
|
echo_green 'Done all stages!'
|
|
@ -1,7 +1,9 @@
|
||||||
SYSCALL CALLING CONVENTION
|
Linux syscall calling convention:
|
||||||
rdi rsi rdx r10 r8 r9
|
rax - syscall number
|
||||||
|
rdi, rsi, rdx, r10, r8, r9 - arguments
|
||||||
return value placed in rax
|
return value placed in rax
|
||||||
|
|
||||||
|
Instruction set:
|
||||||
|
|
||||||
mov rax, imm64
|
mov rax, imm64
|
||||||
>48 b8 IMM64
|
>48 b8 IMM64
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue