03 README

This commit is contained in:
pommicket 2021-11-14 00:33:40 -05:00
parent f7f1f10cb0
commit 7bb8ab02f7
8 changed files with 263 additions and 46 deletions

View file

@ -1,7 +1,7 @@
# stage 01
The code for the compiler for this stage is in the file `in00`. And yes, that's
an input to our previous program, `hexcompile`, from stage 00! To compile it,
an input to our [previous program](../00/README.html), `hexcompile`, from stage 00! To compile it,
run `../00/hexcompile` from this directory. You will get a file, `out00`. That
is the executable for this stage's compiler. Run it (it'll read from the file
`in01` I've provided) and you'll get a file `out01`. That executable will print

View file

@ -1,6 +1,6 @@
# stage 02
The compiler for this stage is in the file `in01`, an input for our previous compiler.
The compiler for this stage is in the file `in01`, an input for our [previous compiler](../01/README.md).
So if you run `../01/out00`, you'll get the file `out01`, which is
this stage's compiler.
The specifics of how this compiler works are in the comments in `in01`, but here I'll
@ -187,5 +187,5 @@ if you use a label without defining it, it uses address 0, rather than outputtin
an error message. This could be fixed: if the value in the label table is 0 and we are
on the second pass, output an error message. Also, duplicate labels aren't detected.
But thanks to labels, for future compilers at least we won't have to calculate
any jump offsets manually.
But thanks to labels, at least we won't have to calculate
any jump offsets manually anymore. With that, let's move on to [stage 03](../03/README.md).

View file

@ -1,4 +1,4 @@
all: out02 out03
all: out02 out03 README.html
out02: in02 ../02/out01
../02/out01
out03: out02 in03

168
03/README.md Normal file
View file

@ -0,0 +1,168 @@
# stage 03
The code for this compiler (the file `in02`, an input for our [stage 02 compiler](../02/README.md))
is 2700 lines—quite a bit larger than the previous ones. And as we'll see, it's a lot more powerful too.
To compile it, run `../02/out01` from this directory.
Let's take a look at `in03`, the example program I've written for it:
```
B=:hello_world
call :puts
; exit code 0
J=d0
syscall x3c
:hello_world
str Hello, world!
xa
x0
; output null-terminated string in rbx
:puts
R=B
call :strlen
D=A
I=R
J=d1
syscall d1
return
; calculate length of string in rbx
:strlen
; keep pointer to start of string
D=B
I=B
:strlen_loop
C=1I
?C=0:strlen_loop_end
I+=d1
!:strlen_loop
:strlen_loop_end
I-=D
A=I
return
```
This language looks a lot nicer than the previous one. No more obscure two-letter label names
and commands! Furthermore, try changing `:strlen_loop` on line 31
to a typo like `:strlen_lop`. You should get:
```
Bad label 001f
```
Not only do we get an error message, we also get the line number
of the error! It's in hexadecimal, unfortunately, but that's
better than nothing.
I spent a while on this compiler (perhaps I went a bit overboard
on the features), because for the 02 language
was the first that was actually pleasant to use!
It's much less sophisticated than even most assembly languages,
but being able to use labels without having to worry about filling
in the offsets later made it way nicer to use than the previous
languages.
In addition to `in03`, this directory also has `ex03`,
which gives examples of all of the instructions supported by this compiler.
Seeing as this is a relatively large compiler,
here is an overview of how it works:
## functions
Thanks to labels, we can actually use functions in this compiler, without
it being a complete nightmare. Functions are called like this:
```
im
--fu
cl (this would call the function ::fu)
```
and at the end of each function, we get `re`, which returns from the function.
I've used the convention of storing return values in `rax` and
passing the argument to a unary function in `rbx`.
This compiler ended up having a lot of functions, some of them used in all sorts
of different places.
## execution
Just as with the 02 compiler, we need two passes:
the first one
computes the address of each label,
and the second one uses the correct addresses to
write the executable.
Each pass is a loop, which starts by incrementing
the line number (`::L#`). Then we read in a line
from the source file, `in03`. This is done one character
at a time, until a newline is reached. The line is stored
in the buffer `::LI`. In the remainder of the program we
(mostly) use the fact that the line is newline-terminated,
rather than keeping track of how long it is.
Once the line is read in, a bunch of tests are performed on it.
We start by looking at the first character: if it's a `;`,
the line is a comment; if it's a `!`, it's an unconditional jump; etc.
Failing that, we look at the second character, to see if it's
`=`, `+=`, `-=`, etc. If it doesn't match any of them, we use
the `::s=` (string equals) function, which conveniently lets you
set the terminator. We check if the line is equal to `"syscall"`
up to a terminator of `' '` to check if it's a syscall, for example.
## `+=`, et al.
We can emit the correct instruction for `D+=C` with:
- `mov rbx, rdx`
- `mov rax, rcx`
- `add rax, rbx`
- `mov rdx, rax`
A similar pattern can be used for `-=`, `&=`, etc.
This made it pretty easy to write the implementation of all of these:
there's one function for setting `rbx` to the first operand (`::B1`),
another for setting `rax` to the second operand (`::A2`), and another for
setting the first operand to `rax` (`::1A`). The implementations of
`+=`/`-=`/etc. just call those three functions, with a bit of stuff in between
to perform the corresponding operation.
A similar approach also works for loading/storing values in memory.
## label list
Instead of a label table, we now have a "label list" (or array
if you prefer) at `::LB`.
A pointer to the current end of the list is stored at `::L$`.
Each entry is the name of the label, including the `:`, then a newline,
then the 4-byte address.
`::ll` is used to look up labels. If it's the first pass,
`::ll` just returns 0. Otherwise, it looks up the label by
comparing it to each entry using `s=` with a terminator of `'\n'`.
If no label matches, we get an error.
## alignment
A lot of data used in this program is
[not correctly aligned](https://en.wikipedia.org/wiki/Bus_error#Unaligned_access)—e.g.
8-byte values are not always stored at an address that is a multiple of 8.
This would be a problem on some processors, but x86-64 can handle it.
It's still not a good idea in practice—reading unaligned memory
is much slower. But we're not really concerned about performance here,
and it would be a bit finnicky to align everything correctly.
However, I have introduced `align` into this language,
which you can put before a label to ensure that its address is aligned
to 8 bytes.
## errors
Errors are handled in functions beginning with `!`, e.g. `::!n` for "bad number".
Each of these ends up calling `::er`. `::er` prints
a string specific to the type of error, then
converts the line number to a string, and prints it.
The line number is always converted to a 4-digit hexadecimal number.
This means it won't fully work past 65,535 lines, but
let's hope we don't need to write any programs that long!
## limitations
Functions in this 03 language will probably overwrite the previous values
of registers. This can make it kind of annoying to call functions, since
you need to make sure you store away any information you'll need after the function.
And the language definitely won't be as nice to use as something with real variables. But overall,
I'm very happy with this compiler, considering it's written in a language with 2-letter label
names.

113
03/ex03
View file

@ -1,42 +1,87 @@
; You can use registers like variables: rax = A, rbx = B, rcx = C, rdx = D, rsi = I, rdi = J, rsp = S, rbp = R
; However, because of the way things are implemented, you should be careful about using A/B as variables:
; they sometimes might not work correctly, and will be overwritten by a lot of statements
; set register to...
; decimal
D=d123
; hexadecimal
D=x1ef
; another register
D=R we can have a comment here and in some other places. not after numbers or labels though.
; label address
D=:label
; add
D+=d4
D+=R
; subtract
D-=d123
D-=R
; left/right shift (only rcx is supported for variable shifts)
D<=C
D<=d33
D>=C
D>=x12
; arithmetic right shift
D]=d7
D]=C
D^=C
D|=C
D&=C
~C
B|=A
8D=C
A=1B
B>=d33
call :funciton
x4b
!:label
?J<B:label
:label
1B=C
; :l ba b
J=d0
A=d60
syscall x3c
align
:label
reserve d1000
B+=J
B<=d9
B-=J
?J=B:label
?A!B:label
?A>B:label
A=:label
x3c
; bitwise xor, or, and
D^=R
D|=R
D&=R
D^=d1
D|=d1
D&=d1
; bitwise not
; (this sets D to ~D)
~D
; dereference
; set 8 bytes at rdx to rbp
8D=R
; set 4 bytes at rdx to ebp
4D=R
2D=R
1D=R
; set rcx/ecx/cx/cl to 8/4/2/1 bytes at rdx
C=8D
C=4D
C=2D
C=1D
; call a function
call :function
; return
return
; label declarations
;:function
;:label
; literal byte
x4b
'H
'i
; string
str This text will appear in the executable!
; unconditional jump
!:label
; conditional jump
?R<S:label
?R=S:label
?R!S:label
?R>S:label
; (unsigned comparisons above/below)
?RaS:label
?RbS:label
; syscall
syscall x3c
; align to 8 bytes
align
; reserve some number of bytes of memory
reserve d1000
; signed/unsigned multiply/divide
imul
idiv
mul
div
:funciton
call A
str Here is some text which will be put in the executable!
?CaD:label
; e.g. to compute 5*3 into rcx (note rdx is wiped in the process):
A=d5
B=d3
mul

View file

@ -2886,6 +2886,9 @@ jm
~~
::LI line buffer
~~
~~
~~
~~
::L$ end of current label list
--LB
::LB labels

View file

@ -1,6 +1,6 @@
; write to stdout
B=:hello_world
call :puts
; exit code 0
J=d0
syscall x3c
@ -11,15 +11,15 @@ x0
; output null-terminated string in rbx
:puts
R=B
call :strlen
I=D
D=A
I=R
J=d1
syscall d1
return
; calculate length of string in rbx
; keeps pointer to start of string in rdx, end of string in rsi
:strlen
; keep pointer to start of string
D=B

View file

@ -24,6 +24,7 @@ hexadecimal digit pairs to a binary file.
- [stage 01](01/README.md) - a language with comments, and 2-character
command codes.
- [stage 02](02/README.md) - a language with labels
- [stage 03](03/README.md) - a language with longer labels, better error messages, and less register manipulation
- more coming soon (hopefully)
## prerequisite knowledge
@ -93,10 +94,10 @@ compile GCC, say, and so all programs around today could be compromised. Of
course, this is practically definitely not the case, but it's still an
interesting experiment to try to create a fully trustable compiler. This
project can't necessarily even do that though, because the Linux kernel, which
we depend on, is compiled from C, so we can't fully trust *it*. To *truly*
create a fully trustable compiler, you'd need to manually write to a USB with a
circuit, create an operating system from nothing (without even a text editor),
and then follow this series, or maybe you don't even trust your CPU...
we depend on, is compiled from C, so we can't fully trust *it*. To
create a *fully* trustable compiler, you'd need to manually write
an operating system to a USB key with a circuit or something,
assuming you trust your CPU...
I'll leave that to someone else.
## license