readme edits
This commit is contained in:
parent
befd4a6435
commit
3255cd32d7
7 changed files with 196 additions and 63 deletions
84
00/README.md
84
00/README.md
|
@ -5,13 +5,14 @@ takes input file `in00` containing space/newline/(any character)-separated
|
||||||
hexadecimal digit pairs (e.g. `3f`) and outputs them as bytes to the file
|
hexadecimal digit pairs (e.g. `3f`) and outputs them as bytes to the file
|
||||||
`out00`. On 64-bit Linux, try running `./hexcompile` from this directory (I've
|
`out00`. On 64-bit Linux, try running `./hexcompile` from this directory (I've
|
||||||
already provided an `in00` file, which you can take a look at), and you will get
|
already provided an `in00` file, which you can take a look at), and you will get
|
||||||
a file named `out00` containing the text `Hello, world!`. This stage is needed
|
a file named `out00` containing the text `Hello, world!`. This stage
|
||||||
so that you can use your favorite text editor to write executables by hand
|
lets you use your favorite text editor to write executables
|
||||||
(which have bytes outside of ASCII/UTF-8). I wrote it with a program called
|
(which have bytes outside of ASCII/UTF-8).
|
||||||
hexedit, which can be found on most Linux distributions. Only 64-bit Linux is
|
I made `hexcompile` with a program called
|
||||||
supported, because each OS/architecture combination would need its own separate
|
[hexedit](https://github.com/pixel/hexedit),
|
||||||
executable. The executable is just 632 bytes long, and you could definitely make
|
which can be found in most Linux package managers.
|
||||||
it even smaller if you wanted to. Let's take a look at what's inside (`od -t x1
|
The executable is just 632 bytes long.
|
||||||
|
Let's take a look at what's inside (`od -t x1
|
||||||
-An -v hexcompile`):
|
-An -v hexcompile`):
|
||||||
|
|
||||||
```
|
```
|
||||||
|
@ -74,12 +75,12 @@ version of ELF)
|
||||||
- `02 00` Object type = executable file (not a dynamic library/etc.)
|
- `02 00` Object type = executable file (not a dynamic library/etc.)
|
||||||
- `3e 00` Architecture x86-64
|
- `3e 00` Architecture x86-64
|
||||||
- `01 00 00 00` Version 1 of ELF, again
|
- `01 00 00 00` Version 1 of ELF, again
|
||||||
- `78 00 40 00 00 00 00 00` **Entry point of the executable** = 0x400078 (explained later)
|
- `78 00 40 00 00 00 00 00` **Entry point of the executable** = 0x400078
|
||||||
- `40 00 00 00 00 00 00 00` Program header table offset in bytes from start of file (see below)
|
- `40 00 00 00 00 00 00 00` Program header table offset in bytes from start of file
|
||||||
- `00 00 00 00 00 00 00 00` Section header table offset (we're not using sections)
|
- `00 00 00 00 00 00 00 00` Section header table offset (we're not using sections)
|
||||||
- `00 00 00 00` Flags (not important)
|
- `00 00 00 00` Flags (not important to us)
|
||||||
- `40 00` The size of this header, in bytes = 64
|
- `40 00` The size of this header, in bytes = 64
|
||||||
- `38 00` Size of the program header (see below) = 56
|
- `38 00` Size of the program header = 56
|
||||||
- `01 00` Number of program headers = 1
|
- `01 00` Number of program headers = 1
|
||||||
- `00 00` Size of each section header (unused)
|
- `00 00` Size of each section header (unused)
|
||||||
- `00 00` Number of section headers (unused)
|
- `00 00` Number of section headers (unused)
|
||||||
|
@ -88,8 +89,8 @@ version of ELF)
|
||||||
You might notice that all the numbers are backwards, e.g. `38 00` for the number
|
You might notice that all the numbers are backwards, e.g. `38 00` for the number
|
||||||
0x0038 (56 decimal). This is because almost all modern architectures (including
|
0x0038 (56 decimal). This is because almost all modern architectures (including
|
||||||
x86-64) are little-endian, meaning that the *least significant byte* goes first,
|
x86-64) are little-endian, meaning that the *least significant byte* goes first,
|
||||||
and the most significant byte goes last. There are various reasons why this is
|
and the most significant byte goes last.
|
||||||
easier to deal with, but I won't explain that here.
|
There are reasons for this ([see here](https://en.wikipedia.org/wiki/Endianness#Optimization), for example, if you're interested).
|
||||||
|
|
||||||
## program header
|
## program header
|
||||||
The program header describes a segment of data that is loaded into memory when
|
The program header describes a segment of data that is loaded into memory when
|
||||||
|
@ -108,11 +109,12 @@ Without further ado, here's the contents of the program header:
|
||||||
|
|
||||||
**wait a minute, what's that?**
|
**wait a minute, what's that?**
|
||||||
|
|
||||||
We just specified the *virtual address* of this segment. This is the virtual
|
This is the virtual
|
||||||
memory address that the segment will be loaded to. Virtual memory means that
|
memory address that the segment will be loaded to.
|
||||||
addresses in our program do not actually correspond to where the memory is
|
Nowadays, computers use virtual memory, meaning that
|
||||||
physically stored in RAM, with the CPU translating between virtual and physical
|
addresses in our program don't actually correspond to where the memory is
|
||||||
memory addresses. There are many reasons for this: making sure each process has
|
physically stored in RAM (the CPU translates between virtual and physical
|
||||||
|
memory addresses). There are many reasons for this: making sure each process has
|
||||||
its own memory space, memory protection, etc. You can read more about it
|
its own memory space, memory protection, etc. You can read more about it
|
||||||
elsewhere.
|
elsewhere.
|
||||||
|
|
||||||
|
@ -129,11 +131,10 @@ that is a multiple of 4096. Our program needs to be loaded into a memory page,
|
||||||
so its *virtual address* needs to be a multiple of 4096. We're using `0x400000`.
|
so its *virtual address* needs to be a multiple of 4096. We're using `0x400000`.
|
||||||
But wait! Didn't we use `0x400078` for the virtual address? Well, yes but that's
|
But wait! Didn't we use `0x400078` for the virtual address? Well, yes but that's
|
||||||
because the *data in the file* is loaded to address `0x400078`. The actual page
|
because the *data in the file* is loaded to address `0x400078`. The actual page
|
||||||
of memory that the OS will allocate for our code will start at `0x400000`. The
|
of memory that the OS will allocate for our segment will start at `0x400000`. The
|
||||||
reason we need to start `0x78` bytes in is that Linux expects the data *in the
|
reason we need to start `0x78` bytes in is that Linux expects the data in the
|
||||||
file* to be at the same position in the page as when it will be loaded, and it
|
file to be at the same position in the page as when it will be loaded, and it
|
||||||
appears at offset `0x78` in our file. But don't worry if you don't understand
|
appears at offset `0x78` in our file.
|
||||||
that.
|
|
||||||
|
|
||||||
## the code
|
## the code
|
||||||
|
|
||||||
|
@ -163,16 +164,17 @@ about it with `man 2 open`.
|
||||||
The first argument, `0x40026d`, is a pointer to some data at the very end of
|
The first argument, `0x40026d`, is a pointer to some data at the very end of
|
||||||
this segment (see further down). Specifically, it holds the bytes
|
this segment (see further down). Specifically, it holds the bytes
|
||||||
`69 6e 30 30 00`, the null-terminated ASCII string `"in00"`.
|
`69 6e 30 30 00`, the null-terminated ASCII string `"in00"`.
|
||||||
This indicates the name of the file. The second argument (`O_RDONLY`, or 0)
|
This indicates the name of the file. The second argument, `0`,
|
||||||
specifies that we will be reading from this file. There is a third argument to
|
specifies that we will (only) be reading from this file. There is a third argument to
|
||||||
this syscall (we'll get to it later), but it's not applicable here so we don't
|
this syscall (we'll get to it later), but it's not applicable here so we don't
|
||||||
set it.
|
set it.
|
||||||
|
|
||||||
This call gives us back a *file descriptor*, which can be used to read from the
|
This call gives us back a *file descriptor*, a number which we can use to read from the
|
||||||
file, in register `rax`. But we don't actually need to look at what file
|
file, in register `rax`. But we don't actually need to look at what file
|
||||||
descriptor Linux gave us. This is because Linux assigns file descriptor numbers
|
descriptor Linux gave us. This is because Linux assigns file descriptor numbers
|
||||||
sequentially, starting from `0` for standard input, `1` for standard output, `2`
|
sequentially, starting from
|
||||||
for standard error, and then `3, 4, 5, ...` for any files our program opens. So
|
[0 for stdin, 1 for stdout, 2 for stderr](https://en.wikipedia.org/wiki/Standard_streams),
|
||||||
|
and then 3, 4, 5, ... for any files our program opens. So
|
||||||
this file, the first one our program opens, will have descriptor `3`.
|
this file, the first one our program opens, will have descriptor `3`.
|
||||||
|
|
||||||
Now we open our output file:
|
Now we open our output file:
|
||||||
|
@ -194,8 +196,8 @@ file (`O_WRONLY = 0x01`), that we want to create it if it doesn't exist
|
||||||
(`O_TRUNC = 0x200`). Secondly, we are setting the third argument this time. It
|
(`O_TRUNC = 0x200`). Secondly, we are setting the third argument this time. It
|
||||||
specifies the permissions our file is created with (`0o755` means user
|
specifies the permissions our file is created with (`0o755` means user
|
||||||
read/write/execute, group/other read/execute). This is not very important to
|
read/write/execute, group/other read/execute). This is not very important to
|
||||||
the actual execution of the program, so don't worry if you don't know what it
|
the actual execution of the program, so don't worry if you don't know
|
||||||
means.
|
about UNIX permissions.
|
||||||
|
|
||||||
Now we can start reading from the file. We're going to loop back to this part of
|
Now we can start reading from the file. We're going to loop back to this part of
|
||||||
the code every time we want to read a new hexadecimal number from the input
|
the code every time we want to read a new hexadecimal number from the input
|
||||||
|
@ -210,7 +212,7 @@ file.
|
||||||
- `0f 05` `syscall`
|
- `0f 05` `syscall`
|
||||||
|
|
||||||
In C, this is `read(3, 0x40026a, 3)`. Here we call syscall #0, `read`, with
|
In C, this is `read(3, 0x40026a, 3)`. Here we call syscall #0, `read`, with
|
||||||
arguments:
|
three arguments:
|
||||||
|
|
||||||
- `fd = 3` This is the descriptor number of our input file.
|
- `fd = 3` This is the descriptor number of our input file.
|
||||||
- `buf = 0x40026a` This is the memory address we want Linux to output the data
|
- `buf = 0x40026a` This is the memory address we want Linux to output the data
|
||||||
|
@ -221,7 +223,6 @@ We're telling Linux to output to `0x40026a`, which is just a part of this
|
||||||
segment (see further down). Normally you would read to a different segment of
|
segment (see further down). Normally you would read to a different segment of
|
||||||
the program from where the code is, but we want this to be as simple as
|
the program from where the code is, but we want this to be as simple as
|
||||||
possible.
|
possible.
|
||||||
|
|
||||||
The number of bytes *actually read*, taking into account that we might have
|
The number of bytes *actually read*, taking into account that we might have
|
||||||
reached the end of the file, is stored in `rax`.
|
reached the end of the file, is stored in `rax`.
|
||||||
|
|
||||||
|
@ -231,8 +232,8 @@ reached the end of the file, is stored in `rax`.
|
||||||
- `0f 8f 50 01 00 00` `jg 0x400250`
|
- `0f 8f 50 01 00 00` `jg 0x400250`
|
||||||
|
|
||||||
This tells the CPU to jump to a later part of the code (address `0x400250`) if 3
|
This tells the CPU to jump to a later part of the code (address `0x400250`) if 3
|
||||||
is greater than the number of bytes read in (in other words, if we reached the
|
is greater than the number of bytes we got, in other words, if we reached the
|
||||||
end of the file). Note that we don't specifiy the *address* to jump to, but
|
end of the file. Note that we don't specifiy the *address* to jump to, but
|
||||||
instead the *relative address*, relative to the first byte after the jump
|
instead the *relative address*, relative to the first byte after the jump
|
||||||
instruction (so here we're saying to jump `0x150` bytes forward). There are
|
instruction (so here we're saying to jump `0x150` bytes forward). There are
|
||||||
reasons for this which I won't get into here.
|
reasons for this which I won't get into here.
|
||||||
|
@ -299,7 +300,7 @@ the one above:
|
||||||
- `48 89 fb` `mov rbx, rdi`
|
- `48 89 fb` `mov rbx, rdi`
|
||||||
- `48 09 d8` `or rax, rbx`
|
- `48 09 d8` `or rax, rbx`
|
||||||
|
|
||||||
Okay, now we have the byte specified by the two hex digits we read in `rax`.
|
Okay, now `rax` contains the byte specified by the two hex digits we read.
|
||||||
|
|
||||||
- `48 89 c3` `mov rbx, rax`
|
- `48 89 c3` `mov rbx, rax`
|
||||||
- `48 b8 6c 02 40 00 00 00 00 00` `mov rax, 0x40026c`
|
- `48 b8 6c 02 40 00 00 00 00 00` `mov rax, 0x40026c`
|
||||||
|
@ -343,8 +344,8 @@ These bytes aren't actually used by our program, and could be set to anything.
|
||||||
These are here because I wasn't sure how long the program would be when I
|
These are here because I wasn't sure how long the program would be when I
|
||||||
started, so I just set the segment size to 512 bytes, which turned out to be
|
started, so I just set the segment size to 512 bytes, which turned out to be
|
||||||
more than enough. I could have cut these out and edited all the addresses to get
|
more than enough. I could have cut these out and edited all the addresses to get
|
||||||
a smaller, cleaner executable, but I'm leaving them in because that's what you
|
a smaller executable, but really there's no point—modern
|
||||||
probably would do if you were doing this for non-instructional purposes.
|
computers can definitely handle 600-byte files.
|
||||||
|
|
||||||
- `31 c0` `mov rax, 0`
|
- `31 c0` `mov rax, 0`
|
||||||
- `48 89 c7` `mov rdi, rax`
|
- `48 89 c7` `mov rdi, rax`
|
||||||
|
@ -355,7 +356,7 @@ This is where we conditionally jumped to way back when we determined if we
|
||||||
reached the end of the file. This calls syscall #60, `exit`, with one argument,
|
reached the end of the file. This calls syscall #60, `exit`, with one argument,
|
||||||
0 (exit code 0, indicating we exited successfully).
|
0 (exit code 0, indicating we exited successfully).
|
||||||
|
|
||||||
You'd normally close the files first (with syscall #3), to tell Linux you're
|
Normally, you should close files descriptors (with syscall #3), to tell Linux you're
|
||||||
done with them, but we don't need to. It'll automatically close all our open
|
done with them, but we don't need to. It'll automatically close all our open
|
||||||
file descriptors when our program exits.
|
file descriptors when our program exits.
|
||||||
|
|
||||||
|
@ -373,8 +374,8 @@ editor and get them translated into a binary file.
|
||||||
|
|
||||||
There are many ways in which this is a bad program. It will *only* properly
|
There are many ways in which this is a bad program. It will *only* properly
|
||||||
handle lowercase hexadecimal digit pairs, separated by exactly one character,
|
handle lowercase hexadecimal digit pairs, separated by exactly one character,
|
||||||
with a terminating character. What's worse, a bad input file (maybe you
|
with a terminating character. What's worse, a bad input file (maybe someone
|
||||||
accidentally write `3F` instead of `3f`) won't print out a nice error message,
|
accidentally writes `3F` instead of `3f`) won't print out a nice error message,
|
||||||
but instead continue processing as usual, without any indication that anything's
|
but instead continue processing as usual, without any indication that anything's
|
||||||
gone wrong, giving you an unexpected result.
|
gone wrong, giving you an unexpected result.
|
||||||
Also, we only read in data *three bytes at a time*, and output one byte at a
|
Also, we only read in data *three bytes at a time*, and output one byte at a
|
||||||
|
@ -385,4 +386,5 @@ a while.
|
||||||
|
|
||||||
But these problems aren't really a big deal. We'll only be running this on
|
But these problems aren't really a big deal. We'll only be running this on
|
||||||
little programs and we'll be sure to check that our input is in the right
|
little programs and we'll be sure to check that our input is in the right
|
||||||
format. And with that, we are ready to move on to the next stage...
|
format. And with that, we are ready to move on to the
|
||||||
|
[next stage...](../01/README.md).
|
||||||
|
|
16
01/README.md
16
01/README.md
|
@ -3,7 +3,7 @@
|
||||||
The code for the compiler for this stage is in the file `in00`. And yes, that's
|
The code for the compiler for this stage is in the file `in00`. And yes, that's
|
||||||
an input to our previous program, `hexcompile`, from stage 00! To compile it,
|
an input to our previous program, `hexcompile`, from stage 00! To compile it,
|
||||||
run `../00/hexcompile` from this directory. You will get a file, `out00`. That
|
run `../00/hexcompile` from this directory. You will get a file, `out00`. That
|
||||||
is the executable for this stage's compiler. Run it (it will read from the file
|
is the executable for this stage's compiler. Run it (it'll read from the file
|
||||||
`in01` I've provided) and you'll get a file `out01`. That executable will print
|
`in01` I've provided) and you'll get a file `out01`. That executable will print
|
||||||
`Hello, world!` when run. Let's take a look at the input we're providing to the
|
`Hello, world!` when run. Let's take a look at the input we're providing to the
|
||||||
stage 01 compiler, `in01`:
|
stage 01 compiler, `in01`:
|
||||||
|
@ -98,10 +98,9 @@ I decided to put the data before the code this time (it made it a bit
|
||||||
easier to work with), so we start a little bit later than the first byte in our
|
easier to work with), so we start a little bit later than the first byte in our
|
||||||
segment. Second and more importantly, rather than 512 bytes, our segment is
|
segment. Second and more importantly, rather than 512 bytes, our segment is
|
||||||
0x21000 = 135168 bytes long! That's because we're storing a table of all the
|
0x21000 = 135168 bytes long! That's because we're storing a table of all the
|
||||||
commands our compiler supports. This table has one eight-byte entry for each
|
commands our compiler supports. This table has one 8-byte entry for each
|
||||||
pair of ASCII characters. There are 128 ASCII characters, so that means it's
|
pair of ASCII characters. There are 128 ASCII characters, so that means it's
|
||||||
`128 * 128 * 8 = 131072` bytes long. And there's a little bit of extra room for
|
`128 * 128 * 8 = 131072` bytes long. This large source file means that compiling our
|
||||||
the actual code. Funnily enough, this large source file means that compiling our
|
|
||||||
stage 01 compiler isn't instantaneous (remember how I said reading 3 bytes at a
|
stage 01 compiler isn't instantaneous (remember how I said reading 3 bytes at a
|
||||||
time would be slow?). On my system, it takes 0.13 seconds to run
|
time would be slow?). On my system, it takes 0.13 seconds to run
|
||||||
`../00/hexcompile`.
|
`../00/hexcompile`.
|
||||||
|
@ -150,7 +149,7 @@ This opens our output file, just like last time.
|
||||||
- `0f 05` `syscall`
|
- `0f 05` `syscall`
|
||||||
|
|
||||||
Here we read two bytes from our input file into memory address `0x400083`. Note
|
Here we read two bytes from our input file into memory address `0x400083`. Note
|
||||||
that this corresponds to those two blanks bytes at the start of our error
|
that this corresponds to those two blank bytes at the start of our error
|
||||||
message.
|
message.
|
||||||
|
|
||||||
- `48 89 c3` `mov rbx, rax`
|
- `48 89 c3` `mov rbx, rax`
|
||||||
|
@ -326,8 +325,8 @@ very end:
|
||||||
|
|
||||||
This is at the position for `||`, and it contains an ELF header. One thing you
|
This is at the position for `||`, and it contains an ELF header. One thing you
|
||||||
might notice is that we decided that each entry is 8 bytes long, but this one is
|
might notice is that we decided that each entry is 8 bytes long, but this one is
|
||||||
0x79 = 121 bytes long! It's okay, our code doesn't actually check that we're
|
0x79 = 121 bytes long! It's okay, our code doesn't check if we use more
|
||||||
using less than 8 bytes of data, but it means that the entries for certain
|
than 8 bytes of data, but it means that the entries for certain
|
||||||
commands, e.g. `}\n` will land right in the middle of the data for the ELF
|
commands, e.g. `}\n` will land right in the middle of the data for the ELF
|
||||||
header. But by a lucky coincidence, all those entries actually land on 0 bytes,
|
header. But by a lucky coincidence, all those entries actually land on 0 bytes,
|
||||||
so they'll just be treated as unrecognized (as they should be).
|
so they'll just be treated as unrecognized (as they should be).
|
||||||
|
@ -338,7 +337,8 @@ Like our last program, this one will be slow for large files. Again, that isn't
|
||||||
much of a problem for us. Also, if you forget a `;` at the end of a file, it'll
|
much of a problem for us. Also, if you forget a `;` at the end of a file, it'll
|
||||||
loop infinitely rather than printing a nice error message. I could have
|
loop infinitely rather than printing a nice error message. I could have
|
||||||
fixed this, but frankly I've had enough of writing code in hexadecimal. So let's
|
fixed this, but frankly I've had enough of writing code in hexadecimal. So let's
|
||||||
move on to stage 02, now that we have a nicer language on our hands. From now
|
move on to [stage 02](../02/README.md),
|
||||||
|
now that we have a nicer language on our hands. From now
|
||||||
on, since we have comments, I'm gonna do most of the explaining in the source file
|
on, since we have comments, I'm gonna do most of the explaining in the source file
|
||||||
itself, rather than the README. But there'll still be a bit of stuff there each
|
itself, rather than the README. But there'll still be a bit of stuff there each
|
||||||
time.
|
time.
|
||||||
|
|
|
@ -1,2 +1,7 @@
|
||||||
|
all: out01 out02 README.html
|
||||||
out01: in01
|
out01: in01
|
||||||
../01/out00
|
../01/out00
|
||||||
|
out02: out01
|
||||||
|
./out01
|
||||||
|
%.html: %.md ../markdown
|
||||||
|
../markdown $<
|
||||||
|
|
110
02/README.md
Normal file
110
02/README.md
Normal file
|
@ -0,0 +1,110 @@
|
||||||
|
# stage 02
|
||||||
|
|
||||||
|
The compiler for this stage is in the file `in01`, an input for our previous compiler.
|
||||||
|
The specifics of how this compiler works are in the comments in that file, but here I'll
|
||||||
|
give an overview.
|
||||||
|
Let's take a look at `in02`, an example input file for this compiler:
|
||||||
|
```
|
||||||
|
jm
|
||||||
|
:-co jump to code
|
||||||
|
::hw
|
||||||
|
'H
|
||||||
|
'e
|
||||||
|
'l
|
||||||
|
'l
|
||||||
|
'o
|
||||||
|
',
|
||||||
|
'
|
||||||
|
'w
|
||||||
|
'o
|
||||||
|
'r
|
||||||
|
'l
|
||||||
|
'd
|
||||||
|
'!
|
||||||
|
\n
|
||||||
|
::he end of hello world
|
||||||
|
::co start of code
|
||||||
|
//
|
||||||
|
// now we'll calculate the length of the hello world string
|
||||||
|
// by subtracting hw from he.
|
||||||
|
//
|
||||||
|
im
|
||||||
|
--he
|
||||||
|
BA
|
||||||
|
im
|
||||||
|
--hw
|
||||||
|
nA
|
||||||
|
+B
|
||||||
|
DA put length in rdx
|
||||||
|
// okay now we can write it
|
||||||
|
im
|
||||||
|
##1.
|
||||||
|
JA set rdi to 1 (stdout)
|
||||||
|
im
|
||||||
|
--hw
|
||||||
|
IA set rsi to a pointer to "Hello, world!\n"
|
||||||
|
im
|
||||||
|
##1. write
|
||||||
|
sy
|
||||||
|
im
|
||||||
|
##0. exit code 0
|
||||||
|
JA
|
||||||
|
im
|
||||||
|
##3c. exit = syscall 0x3c
|
||||||
|
sy
|
||||||
|
```
|
||||||
|
|
||||||
|
You can try adding more characters to the hello world message, and it'll just work;
|
||||||
|
the length of the text is computed automatically!
|
||||||
|
|
||||||
|
This time, commands are separated by newlines instead of semicolons.
|
||||||
|
Each line begins with a 2-character command identifier. There are some special identifiers though:
|
||||||
|
|
||||||
|
- `::` marks a *label*
|
||||||
|
- `--` outputs a label's (absolute) address
|
||||||
|
- `:-` outputs a label's relative address
|
||||||
|
- `##` outputs a number
|
||||||
|
|
||||||
|
All other commands work like they did in the previous compiler—if you scroll down in the
|
||||||
|
`in01` source file, you'll see the full command table.
|
||||||
|
|
||||||
|
## labels
|
||||||
|
|
||||||
|
Labels are the most important new feature of this language.
|
||||||
|
|
||||||
|
## two passes?
|
||||||
|
|
||||||
|
## other features
|
||||||
|
|
||||||
|
Now instead of writing out each of the 8 bytes making up a number,
|
||||||
|
we can just write it in hexadecimal (e.g. `##3c.` for `3c 00 00 00 00 00 00 00`),
|
||||||
|
and the compiler will automatically
|
||||||
|
extend it to 8 bytes.
|
||||||
|
This is especially nice because we don't need to write numbers backwards
|
||||||
|
for little-endianness anymore!
|
||||||
|
Numbers cannot appear at the end of a line (this was
|
||||||
|
to make the compiler simpler to write), so I'm adding a `.` at the end of
|
||||||
|
each one to avoid making that mistake.
|
||||||
|
|
||||||
|
Anything after a command is treated as a comment;
|
||||||
|
additionally `//` can be used for comments on their own lines.
|
||||||
|
I decided to implement them as simply as possible:
|
||||||
|
I just added the command `//` to the command table, which outputs the byte `0x90`—this
|
||||||
|
means "do nothing" (`nop`) in x86-64.
|
||||||
|
Note that this means that the following code will not work as expected:
|
||||||
|
```
|
||||||
|
im
|
||||||
|
// load the value 0x333 into rax
|
||||||
|
##333.
|
||||||
|
```
|
||||||
|
since `0x90` gets inserted between the "load immediate" instruction code, and the immediate.
|
||||||
|
|
||||||
|
## limitations
|
||||||
|
|
||||||
|
Many of the limitations of our previous compilers apply to this one. Also,
|
||||||
|
if you use a label without defining it, it uses address 0, rather than outputting
|
||||||
|
an error message. This could be fixed: if the value in the label table is 0, and if we are
|
||||||
|
on the second pass, output an error message. This compiler was already tedious enough
|
||||||
|
to implement, though!
|
||||||
|
But thanks to labels, for future compilers at least we won't have to calculate
|
||||||
|
any jump offsets manually.
|
1
Makefile
1
Makefile
|
@ -1,6 +1,7 @@
|
||||||
all: markdown README.html
|
all: markdown README.html
|
||||||
$(MAKE) -C 00
|
$(MAKE) -C 00
|
||||||
$(MAKE) -C 01
|
$(MAKE) -C 01
|
||||||
|
$(MAKE) -C 02
|
||||||
markdown: markdown.c
|
markdown: markdown.c
|
||||||
$(CC) -O2 -o markdown -Wall -Wconversion -Wshadow -std=c89 markdown.c
|
$(CC) -O2 -o markdown -Wall -Wconversion -Wshadow -std=c89 markdown.c
|
||||||
README.html: markdown README.md
|
README.html: markdown README.md
|
||||||
|
|
26
README.md
26
README.md
|
@ -3,9 +3,9 @@
|
||||||
Compilers nowadays are written in languages like C, which themselves need to be
|
Compilers nowadays are written in languages like C, which themselves need to be
|
||||||
compiled. But then, you need a C compiler to compile your C compiler! Of course,
|
compiled. But then, you need a C compiler to compile your C compiler! Of course,
|
||||||
the very first C compiler was not written in C (because how would it be
|
the very first C compiler was not written in C (because how would it be
|
||||||
compiled?). Instead, it was slowly built up, starting from a very basic
|
compiled?). Instead, it was built up over time, starting from a very basic
|
||||||
assembler, eventually reacing a full-scale compiler. This process is known as
|
assembler, eventually reaching a full-scale compiler.
|
||||||
bootstrapping. In this repository, we'll explore how that's done. Each directory
|
In this repository, we'll explore how that's done. Each directory
|
||||||
represents a new "stage" in the process. The first one, `00`, is a hand-written
|
represents a new "stage" in the process. The first one, `00`, is a hand-written
|
||||||
executable, and the last one will be a C compiler. Each directory has its own
|
executable, and the last one will be a C compiler. Each directory has its own
|
||||||
README explaining what's going on.
|
README explaining what's going on.
|
||||||
|
@ -13,15 +13,21 @@ README explaining what's going on.
|
||||||
You can run `bootstrap.sh` to run through and test every stage.
|
You can run `bootstrap.sh` to run through and test every stage.
|
||||||
To get HTML versions of all README pages, run `make`.
|
To get HTML versions of all README pages, run `make`.
|
||||||
|
|
||||||
## the basics
|
Note that the executables produced in this series will only run on
|
||||||
|
64-bit Linux, because each OS/architecture combination would need its own separate
|
||||||
|
executable.
|
||||||
|
|
||||||
|
The README for the first stage is [here](00/README.md).
|
||||||
|
|
||||||
|
## prerequisite knowledge
|
||||||
|
|
||||||
In this series, I want to explain *everything* that's going on. I'm going to
|
In this series, I want to explain *everything* that's going on. I'm going to
|
||||||
need to assume some passing knowledge, so here's a quick overview of what you'll
|
need to assume some passing knowledge, so here's a quick overview of what you'll
|
||||||
want to know before starting. I can't explain everything so you may need to do
|
want to know before starting.
|
||||||
your own research. You don't need to understand each of these in full, just get
|
You don't need to understand everything about each of these, just get
|
||||||
a general idea at least:
|
a general idea:
|
||||||
|
|
||||||
- what an operating system is
|
- what a system call is
|
||||||
- what memory is
|
- what memory is
|
||||||
- what a programming language is
|
- what a programming language is
|
||||||
- what a compiler is
|
- what a compiler is
|
||||||
|
@ -35,7 +41,7 @@ decimal.
|
||||||
- bits, bytes, kilobytes, etc.
|
- bits, bytes, kilobytes, etc.
|
||||||
- bitwise operations (not, or, and, xor, left shift, right shift)
|
- bitwise operations (not, or, and, xor, left shift, right shift)
|
||||||
- 2's complement
|
- 2's complement
|
||||||
- null-terminated strings
|
- ASCII, null-terminated strings
|
||||||
- how pointers work
|
- how pointers work
|
||||||
- how floating-point numbers work
|
- how floating-point numbers work
|
||||||
- maybe some basic Intel-style x86-64 assembly (you can probably pick it up on
|
- maybe some basic Intel-style x86-64 assembly (you can probably pick it up on
|
||||||
|
@ -73,7 +79,7 @@ with itself, we'll get the same executable either way.
|
||||||
|
|
||||||
In 1984, Ken Thompson wrote the well-known article
|
In 1984, Ken Thompson wrote the well-known article
|
||||||
[Reflections on Trusting Trust](http://users.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf).
|
[Reflections on Trusting Trust](http://users.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf).
|
||||||
This is one of the things that inspired me to start this project. To summarize
|
This is one of the inspirations for this project. To summarize
|
||||||
the article: it is possible to create a malicious C compiler which will
|
the article: it is possible to create a malicious C compiler which will
|
||||||
replicate its own malicious functionalities (e.g. detecting password-checking
|
replicate its own malicious functionalities (e.g. detecting password-checking
|
||||||
routines to make them also accept another password the attacker knows) when used
|
routines to make them also accept another password the attacker knows) when used
|
||||||
|
|
17
markdown.c
17
markdown.c
|
@ -57,7 +57,8 @@ static void output_md_text(FILE *out, int *flags, int line_number, const char *t
|
||||||
break;
|
break;
|
||||||
case '[': {
|
case '[': {
|
||||||
/* link */
|
/* link */
|
||||||
const char *label, *url, *label_end, *url_end;
|
char url2[256] = {0};
|
||||||
|
const char *label, *url, *label_end, *url_end, *dot;
|
||||||
int n_label, n_url;
|
int n_label, n_url;
|
||||||
|
|
||||||
label = p+1;
|
label = p+1;
|
||||||
|
@ -79,8 +80,16 @@ static void output_md_text(FILE *out, int *flags, int line_number, const char *t
|
||||||
|
|
||||||
n_label = (int)(label_end - label);
|
n_label = (int)(label_end - label);
|
||||||
n_url = (int)(url_end - url);
|
n_url = (int)(url_end - url);
|
||||||
fprintf(out, "<a href=\"%.*s\" target=\"_blank\">%.*s</a>",
|
if (n_url > sizeof url2-8)
|
||||||
n_url, url, n_label, label);
|
n_url = sizeof url2-8;
|
||||||
|
sprintf(url2, "%.*s", n_url, url);
|
||||||
|
dot = strrchr(url2, '.');
|
||||||
|
if (dot && strcmp(dot, ".md") == 0) {
|
||||||
|
/* replace links to md files with links to html files */
|
||||||
|
strcpy(dot, ".html");
|
||||||
|
}
|
||||||
|
fprintf(out, "<a href=\"%s\" target=\"_blank\">%.*s</a>",
|
||||||
|
url2, n_label, label);
|
||||||
p = url_end;
|
p = url_end;
|
||||||
} break;
|
} break;
|
||||||
case '-':
|
case '-':
|
||||||
|
@ -150,7 +159,7 @@ int main(int argc, char **argv) {
|
||||||
"<meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n"
|
"<meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n"
|
||||||
"<meta charset=\"utf-8\">\n"
|
"<meta charset=\"utf-8\">\n"
|
||||||
"<style>\n"
|
"<style>\n"
|
||||||
"body { font-family: sans-serif; }\n"
|
"body { font-family: serif; }\n"
|
||||||
"</style>\n"
|
"</style>\n"
|
||||||
"<title>%s</title>\n"
|
"<title>%s</title>\n"
|
||||||
"</head>\n"
|
"</head>\n"
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue