169 lines
5.8 KiB
Markdown
169 lines
5.8 KiB
Markdown
|
# stage 03
|
||
|
The code for this compiler (the file `in02`, an input for our [stage 02 compiler](../02/README.md))
|
||
|
is 2700 lines—quite a bit larger than the previous ones. And as we'll see, it's a lot more powerful too.
|
||
|
To compile it, run `../02/out01` from this directory.
|
||
|
Let's take a look at `in03`, the example program I've written for it:
|
||
|
```
|
||
|
B=:hello_world
|
||
|
call :puts
|
||
|
; exit code 0
|
||
|
J=d0
|
||
|
syscall x3c
|
||
|
|
||
|
:hello_world
|
||
|
str Hello, world!
|
||
|
xa
|
||
|
x0
|
||
|
|
||
|
; output null-terminated string in rbx
|
||
|
:puts
|
||
|
R=B
|
||
|
call :strlen
|
||
|
D=A
|
||
|
I=R
|
||
|
J=d1
|
||
|
syscall d1
|
||
|
return
|
||
|
|
||
|
; calculate length of string in rbx
|
||
|
:strlen
|
||
|
; keep pointer to start of string
|
||
|
D=B
|
||
|
I=B
|
||
|
:strlen_loop
|
||
|
C=1I
|
||
|
?C=0:strlen_loop_end
|
||
|
I+=d1
|
||
|
!:strlen_loop
|
||
|
:strlen_loop_end
|
||
|
I-=D
|
||
|
A=I
|
||
|
return
|
||
|
```
|
||
|
This language looks a lot nicer than the previous one. No more obscure two-letter label names
|
||
|
and commands! Furthermore, try changing `:strlen_loop` on line 31
|
||
|
to a typo like `:strlen_lop`. You should get:
|
||
|
```
|
||
|
Bad label 001f
|
||
|
```
|
||
|
Not only do we get an error message, we also get the line number
|
||
|
of the error! It's in hexadecimal, unfortunately, but that's
|
||
|
better than nothing.
|
||
|
|
||
|
I spent a while on this compiler (perhaps I went a bit overboard
|
||
|
on the features), because for the 02 language
|
||
|
was the first that was actually pleasant to use!
|
||
|
It's much less sophisticated than even most assembly languages,
|
||
|
but being able to use labels without having to worry about filling
|
||
|
in the offsets later made it way nicer to use than the previous
|
||
|
languages.
|
||
|
|
||
|
In addition to `in03`, this directory also has `ex03`,
|
||
|
which gives examples of all of the instructions supported by this compiler.
|
||
|
|
||
|
Seeing as this is a relatively large compiler,
|
||
|
here is an overview of how it works:
|
||
|
|
||
|
## functions
|
||
|
|
||
|
Thanks to labels, we can actually use functions in this compiler, without
|
||
|
it being a complete nightmare. Functions are called like this:
|
||
|
```
|
||
|
im
|
||
|
--fu
|
||
|
cl (this would call the function ::fu)
|
||
|
```
|
||
|
and at the end of each function, we get `re`, which returns from the function.
|
||
|
I've used the convention of storing return values in `rax` and
|
||
|
passing the argument to a unary function in `rbx`.
|
||
|
|
||
|
This compiler ended up having a lot of functions, some of them used in all sorts
|
||
|
of different places.
|
||
|
|
||
|
## execution
|
||
|
|
||
|
Just as with the 02 compiler, we need two passes:
|
||
|
the first one
|
||
|
computes the address of each label,
|
||
|
and the second one uses the correct addresses to
|
||
|
write the executable.
|
||
|
|
||
|
Each pass is a loop, which starts by incrementing
|
||
|
the line number (`::L#`). Then we read in a line
|
||
|
from the source file, `in03`. This is done one character
|
||
|
at a time, until a newline is reached. The line is stored
|
||
|
in the buffer `::LI`. In the remainder of the program we
|
||
|
(mostly) use the fact that the line is newline-terminated,
|
||
|
rather than keeping track of how long it is.
|
||
|
|
||
|
Once the line is read in, a bunch of tests are performed on it.
|
||
|
We start by looking at the first character: if it's a `;`,
|
||
|
the line is a comment; if it's a `!`, it's an unconditional jump; etc.
|
||
|
Failing that, we look at the second character, to see if it's
|
||
|
`=`, `+=`, `-=`, etc. If it doesn't match any of them, we use
|
||
|
the `::s=` (string equals) function, which conveniently lets you
|
||
|
set the terminator. We check if the line is equal to `"syscall"`
|
||
|
up to a terminator of `' '` to check if it's a syscall, for example.
|
||
|
|
||
|
## `+=`, et al.
|
||
|
|
||
|
We can emit the correct instruction for `D+=C` with:
|
||
|
|
||
|
- `mov rbx, rdx`
|
||
|
- `mov rax, rcx`
|
||
|
- `add rax, rbx`
|
||
|
- `mov rdx, rax`
|
||
|
|
||
|
A similar pattern can be used for `-=`, `&=`, etc.
|
||
|
This made it pretty easy to write the implementation of all of these:
|
||
|
there's one function for setting `rbx` to the first operand (`::B1`),
|
||
|
another for setting `rax` to the second operand (`::A2`), and another for
|
||
|
setting the first operand to `rax` (`::1A`). The implementations of
|
||
|
`+=`/`-=`/etc. just call those three functions, with a bit of stuff in between
|
||
|
to perform the corresponding operation.
|
||
|
A similar approach also works for loading/storing values in memory.
|
||
|
|
||
|
## label list
|
||
|
|
||
|
Instead of a label table, we now have a "label list" (or array
|
||
|
if you prefer) at `::LB`.
|
||
|
A pointer to the current end of the list is stored at `::L$`.
|
||
|
Each entry is the name of the label, including the `:`, then a newline,
|
||
|
then the 4-byte address.
|
||
|
`::ll` is used to look up labels. If it's the first pass,
|
||
|
`::ll` just returns 0. Otherwise, it looks up the label by
|
||
|
comparing it to each entry using `s=` with a terminator of `'\n'`.
|
||
|
If no label matches, we get an error.
|
||
|
|
||
|
## alignment
|
||
|
A lot of data used in this program is
|
||
|
[not correctly aligned](https://en.wikipedia.org/wiki/Bus_error#Unaligned_access)—e.g.
|
||
|
8-byte values are not always stored at an address that is a multiple of 8.
|
||
|
This would be a problem on some processors, but x86-64 can handle it.
|
||
|
It's still not a good idea in practice—reading unaligned memory
|
||
|
is much slower. But we're not really concerned about performance here,
|
||
|
and it would be a bit finnicky to align everything correctly.
|
||
|
However, I have introduced `align` into this language,
|
||
|
which you can put before a label to ensure that its address is aligned
|
||
|
to 8 bytes.
|
||
|
|
||
|
## errors
|
||
|
|
||
|
Errors are handled in functions beginning with `!`, e.g. `::!n` for "bad number".
|
||
|
Each of these ends up calling `::er`. `::er` prints
|
||
|
a string specific to the type of error, then
|
||
|
converts the line number to a string, and prints it.
|
||
|
The line number is always converted to a 4-digit hexadecimal number.
|
||
|
This means it won't fully work past 65,535 lines, but
|
||
|
let's hope we don't need to write any programs that long!
|
||
|
|
||
|
## limitations
|
||
|
|
||
|
Functions in this 03 language will probably overwrite the previous values
|
||
|
of registers. This can make it kind of annoying to call functions, since
|
||
|
you need to make sure you store away any information you'll need after the function.
|
||
|
And the language definitely won't be as nice to use as something with real variables. But overall,
|
||
|
I'm very happy with this compiler, considering it's written in a language with 2-letter label
|
||
|
names.
|
||
|
|