03 README
This commit is contained in:
parent
f7f1f10cb0
commit
7bb8ab02f7
8 changed files with 263 additions and 46 deletions
|
@ -1,7 +1,7 @@
|
||||||
# stage 01
|
# stage 01
|
||||||
|
|
||||||
The code for the compiler for this stage is in the file `in00`. And yes, that's
|
The code for the compiler for this stage is in the file `in00`. And yes, that's
|
||||||
an input to our previous program, `hexcompile`, from stage 00! To compile it,
|
an input to our [previous program](../00/README.html), `hexcompile`, from stage 00! To compile it,
|
||||||
run `../00/hexcompile` from this directory. You will get a file, `out00`. That
|
run `../00/hexcompile` from this directory. You will get a file, `out00`. That
|
||||||
is the executable for this stage's compiler. Run it (it'll read from the file
|
is the executable for this stage's compiler. Run it (it'll read from the file
|
||||||
`in01` I've provided) and you'll get a file `out01`. That executable will print
|
`in01` I've provided) and you'll get a file `out01`. That executable will print
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
# stage 02
|
# stage 02
|
||||||
|
|
||||||
The compiler for this stage is in the file `in01`, an input for our previous compiler.
|
The compiler for this stage is in the file `in01`, an input for our [previous compiler](../01/README.md).
|
||||||
So if you run `../01/out00`, you'll get the file `out01`, which is
|
So if you run `../01/out00`, you'll get the file `out01`, which is
|
||||||
this stage's compiler.
|
this stage's compiler.
|
||||||
The specifics of how this compiler works are in the comments in `in01`, but here I'll
|
The specifics of how this compiler works are in the comments in `in01`, but here I'll
|
||||||
|
@ -187,5 +187,5 @@ if you use a label without defining it, it uses address 0, rather than outputtin
|
||||||
an error message. This could be fixed: if the value in the label table is 0 and we are
|
an error message. This could be fixed: if the value in the label table is 0 and we are
|
||||||
on the second pass, output an error message. Also, duplicate labels aren't detected.
|
on the second pass, output an error message. Also, duplicate labels aren't detected.
|
||||||
|
|
||||||
But thanks to labels, for future compilers at least we won't have to calculate
|
But thanks to labels, at least we won't have to calculate
|
||||||
any jump offsets manually.
|
any jump offsets manually anymore. With that, let's move on to [stage 03](../03/README.md).
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
all: out02 out03
|
all: out02 out03 README.html
|
||||||
out02: in02 ../02/out01
|
out02: in02 ../02/out01
|
||||||
../02/out01
|
../02/out01
|
||||||
out03: out02 in03
|
out03: out02 in03
|
||||||
|
|
168
03/README.md
Normal file
168
03/README.md
Normal file
|
@ -0,0 +1,168 @@
|
||||||
|
# stage 03
|
||||||
|
The code for this compiler (the file `in02`, an input for our [stage 02 compiler](../02/README.md))
|
||||||
|
is 2700 lines—quite a bit larger than the previous ones. And as we'll see, it's a lot more powerful too.
|
||||||
|
To compile it, run `../02/out01` from this directory.
|
||||||
|
Let's take a look at `in03`, the example program I've written for it:
|
||||||
|
```
|
||||||
|
B=:hello_world
|
||||||
|
call :puts
|
||||||
|
; exit code 0
|
||||||
|
J=d0
|
||||||
|
syscall x3c
|
||||||
|
|
||||||
|
:hello_world
|
||||||
|
str Hello, world!
|
||||||
|
xa
|
||||||
|
x0
|
||||||
|
|
||||||
|
; output null-terminated string in rbx
|
||||||
|
:puts
|
||||||
|
R=B
|
||||||
|
call :strlen
|
||||||
|
D=A
|
||||||
|
I=R
|
||||||
|
J=d1
|
||||||
|
syscall d1
|
||||||
|
return
|
||||||
|
|
||||||
|
; calculate length of string in rbx
|
||||||
|
:strlen
|
||||||
|
; keep pointer to start of string
|
||||||
|
D=B
|
||||||
|
I=B
|
||||||
|
:strlen_loop
|
||||||
|
C=1I
|
||||||
|
?C=0:strlen_loop_end
|
||||||
|
I+=d1
|
||||||
|
!:strlen_loop
|
||||||
|
:strlen_loop_end
|
||||||
|
I-=D
|
||||||
|
A=I
|
||||||
|
return
|
||||||
|
```
|
||||||
|
This language looks a lot nicer than the previous one. No more obscure two-letter label names
|
||||||
|
and commands! Furthermore, try changing `:strlen_loop` on line 31
|
||||||
|
to a typo like `:strlen_lop`. You should get:
|
||||||
|
```
|
||||||
|
Bad label 001f
|
||||||
|
```
|
||||||
|
Not only do we get an error message, we also get the line number
|
||||||
|
of the error! It's in hexadecimal, unfortunately, but that's
|
||||||
|
better than nothing.
|
||||||
|
|
||||||
|
I spent a while on this compiler (perhaps I went a bit overboard
|
||||||
|
on the features), because for the 02 language
|
||||||
|
was the first that was actually pleasant to use!
|
||||||
|
It's much less sophisticated than even most assembly languages,
|
||||||
|
but being able to use labels without having to worry about filling
|
||||||
|
in the offsets later made it way nicer to use than the previous
|
||||||
|
languages.
|
||||||
|
|
||||||
|
In addition to `in03`, this directory also has `ex03`,
|
||||||
|
which gives examples of all of the instructions supported by this compiler.
|
||||||
|
|
||||||
|
Seeing as this is a relatively large compiler,
|
||||||
|
here is an overview of how it works:
|
||||||
|
|
||||||
|
## functions
|
||||||
|
|
||||||
|
Thanks to labels, we can actually use functions in this compiler, without
|
||||||
|
it being a complete nightmare. Functions are called like this:
|
||||||
|
```
|
||||||
|
im
|
||||||
|
--fu
|
||||||
|
cl (this would call the function ::fu)
|
||||||
|
```
|
||||||
|
and at the end of each function, we get `re`, which returns from the function.
|
||||||
|
I've used the convention of storing return values in `rax` and
|
||||||
|
passing the argument to a unary function in `rbx`.
|
||||||
|
|
||||||
|
This compiler ended up having a lot of functions, some of them used in all sorts
|
||||||
|
of different places.
|
||||||
|
|
||||||
|
## execution
|
||||||
|
|
||||||
|
Just as with the 02 compiler, we need two passes:
|
||||||
|
the first one
|
||||||
|
computes the address of each label,
|
||||||
|
and the second one uses the correct addresses to
|
||||||
|
write the executable.
|
||||||
|
|
||||||
|
Each pass is a loop, which starts by incrementing
|
||||||
|
the line number (`::L#`). Then we read in a line
|
||||||
|
from the source file, `in03`. This is done one character
|
||||||
|
at a time, until a newline is reached. The line is stored
|
||||||
|
in the buffer `::LI`. In the remainder of the program we
|
||||||
|
(mostly) use the fact that the line is newline-terminated,
|
||||||
|
rather than keeping track of how long it is.
|
||||||
|
|
||||||
|
Once the line is read in, a bunch of tests are performed on it.
|
||||||
|
We start by looking at the first character: if it's a `;`,
|
||||||
|
the line is a comment; if it's a `!`, it's an unconditional jump; etc.
|
||||||
|
Failing that, we look at the second character, to see if it's
|
||||||
|
`=`, `+=`, `-=`, etc. If it doesn't match any of them, we use
|
||||||
|
the `::s=` (string equals) function, which conveniently lets you
|
||||||
|
set the terminator. We check if the line is equal to `"syscall"`
|
||||||
|
up to a terminator of `' '` to check if it's a syscall, for example.
|
||||||
|
|
||||||
|
## `+=`, et al.
|
||||||
|
|
||||||
|
We can emit the correct instruction for `D+=C` with:
|
||||||
|
|
||||||
|
- `mov rbx, rdx`
|
||||||
|
- `mov rax, rcx`
|
||||||
|
- `add rax, rbx`
|
||||||
|
- `mov rdx, rax`
|
||||||
|
|
||||||
|
A similar pattern can be used for `-=`, `&=`, etc.
|
||||||
|
This made it pretty easy to write the implementation of all of these:
|
||||||
|
there's one function for setting `rbx` to the first operand (`::B1`),
|
||||||
|
another for setting `rax` to the second operand (`::A2`), and another for
|
||||||
|
setting the first operand to `rax` (`::1A`). The implementations of
|
||||||
|
`+=`/`-=`/etc. just call those three functions, with a bit of stuff in between
|
||||||
|
to perform the corresponding operation.
|
||||||
|
A similar approach also works for loading/storing values in memory.
|
||||||
|
|
||||||
|
## label list
|
||||||
|
|
||||||
|
Instead of a label table, we now have a "label list" (or array
|
||||||
|
if you prefer) at `::LB`.
|
||||||
|
A pointer to the current end of the list is stored at `::L$`.
|
||||||
|
Each entry is the name of the label, including the `:`, then a newline,
|
||||||
|
then the 4-byte address.
|
||||||
|
`::ll` is used to look up labels. If it's the first pass,
|
||||||
|
`::ll` just returns 0. Otherwise, it looks up the label by
|
||||||
|
comparing it to each entry using `s=` with a terminator of `'\n'`.
|
||||||
|
If no label matches, we get an error.
|
||||||
|
|
||||||
|
## alignment
|
||||||
|
A lot of data used in this program is
|
||||||
|
[not correctly aligned](https://en.wikipedia.org/wiki/Bus_error#Unaligned_access)—e.g.
|
||||||
|
8-byte values are not always stored at an address that is a multiple of 8.
|
||||||
|
This would be a problem on some processors, but x86-64 can handle it.
|
||||||
|
It's still not a good idea in practice—reading unaligned memory
|
||||||
|
is much slower. But we're not really concerned about performance here,
|
||||||
|
and it would be a bit finnicky to align everything correctly.
|
||||||
|
However, I have introduced `align` into this language,
|
||||||
|
which you can put before a label to ensure that its address is aligned
|
||||||
|
to 8 bytes.
|
||||||
|
|
||||||
|
## errors
|
||||||
|
|
||||||
|
Errors are handled in functions beginning with `!`, e.g. `::!n` for "bad number".
|
||||||
|
Each of these ends up calling `::er`. `::er` prints
|
||||||
|
a string specific to the type of error, then
|
||||||
|
converts the line number to a string, and prints it.
|
||||||
|
The line number is always converted to a 4-digit hexadecimal number.
|
||||||
|
This means it won't fully work past 65,535 lines, but
|
||||||
|
let's hope we don't need to write any programs that long!
|
||||||
|
|
||||||
|
## limitations
|
||||||
|
|
||||||
|
Functions in this 03 language will probably overwrite the previous values
|
||||||
|
of registers. This can make it kind of annoying to call functions, since
|
||||||
|
you need to make sure you store away any information you'll need after the function.
|
||||||
|
And the language definitely won't be as nice to use as something with real variables. But overall,
|
||||||
|
I'm very happy with this compiler, considering it's written in a language with 2-letter label
|
||||||
|
names.
|
||||||
|
|
113
03/ex03
113
03/ex03
|
@ -1,42 +1,87 @@
|
||||||
|
; You can use registers like variables: rax = A, rbx = B, rcx = C, rdx = D, rsi = I, rdi = J, rsp = S, rbp = R
|
||||||
|
; However, because of the way things are implemented, you should be careful about using A/B as variables:
|
||||||
|
; they sometimes might not work correctly, and will be overwritten by a lot of statements
|
||||||
|
|
||||||
|
; set register to...
|
||||||
|
; decimal
|
||||||
|
D=d123
|
||||||
|
; hexadecimal
|
||||||
|
D=x1ef
|
||||||
|
; another register
|
||||||
|
D=R we can have a comment here and in some other places. not after numbers or labels though.
|
||||||
|
; label address
|
||||||
|
D=:label
|
||||||
|
; add
|
||||||
D+=d4
|
D+=d4
|
||||||
|
D+=R
|
||||||
|
; subtract
|
||||||
|
D-=d123
|
||||||
|
D-=R
|
||||||
|
; left/right shift (only rcx is supported for variable shifts)
|
||||||
|
D<=C
|
||||||
|
D<=d33
|
||||||
|
D>=C
|
||||||
|
D>=x12
|
||||||
|
; arithmetic right shift
|
||||||
D]=d7
|
D]=d7
|
||||||
D]=C
|
D]=C
|
||||||
D^=C
|
; bitwise xor, or, and
|
||||||
D|=C
|
D^=R
|
||||||
D&=C
|
D|=R
|
||||||
~C
|
D&=R
|
||||||
B|=A
|
D^=d1
|
||||||
8D=C
|
D|=d1
|
||||||
A=1B
|
D&=d1
|
||||||
B>=d33
|
; bitwise not
|
||||||
call :funciton
|
; (this sets D to ~D)
|
||||||
x4b
|
~D
|
||||||
!:label
|
; dereference
|
||||||
?J<B:label
|
; set 8 bytes at rdx to rbp
|
||||||
:label
|
8D=R
|
||||||
1B=C
|
; set 4 bytes at rdx to ebp
|
||||||
; :l ba b
|
4D=R
|
||||||
J=d0
|
2D=R
|
||||||
A=d60
|
1D=R
|
||||||
syscall x3c
|
; set rcx/ecx/cx/cl to 8/4/2/1 bytes at rdx
|
||||||
align
|
C=8D
|
||||||
:label
|
C=4D
|
||||||
reserve d1000
|
C=2D
|
||||||
B+=J
|
C=1D
|
||||||
B<=d9
|
; call a function
|
||||||
B-=J
|
call :function
|
||||||
?J=B:label
|
; return
|
||||||
?A!B:label
|
|
||||||
?A>B:label
|
|
||||||
A=:label
|
|
||||||
x3c
|
|
||||||
return
|
return
|
||||||
|
; label declarations
|
||||||
|
;:function
|
||||||
|
;:label
|
||||||
|
; literal byte
|
||||||
|
x4b
|
||||||
|
'H
|
||||||
|
'i
|
||||||
|
; string
|
||||||
|
str This text will appear in the executable!
|
||||||
|
; unconditional jump
|
||||||
|
!:label
|
||||||
|
; conditional jump
|
||||||
|
?R<S:label
|
||||||
|
?R=S:label
|
||||||
|
?R!S:label
|
||||||
|
?R>S:label
|
||||||
|
; (unsigned comparisons above/below)
|
||||||
|
?RaS:label
|
||||||
|
?RbS:label
|
||||||
|
; syscall
|
||||||
|
syscall x3c
|
||||||
|
; align to 8 bytes
|
||||||
|
align
|
||||||
|
; reserve some number of bytes of memory
|
||||||
|
reserve d1000
|
||||||
|
; signed/unsigned multiply/divide
|
||||||
imul
|
imul
|
||||||
idiv
|
idiv
|
||||||
mul
|
mul
|
||||||
div
|
div
|
||||||
:funciton
|
; e.g. to compute 5*3 into rcx (note rdx is wiped in the process):
|
||||||
call A
|
A=d5
|
||||||
str Here is some text which will be put in the executable!
|
B=d3
|
||||||
?CaD:label
|
mul
|
||||||
|
|
||||||
|
|
3
03/in02
3
03/in02
|
@ -2886,6 +2886,9 @@ jm
|
||||||
~~
|
~~
|
||||||
::LI line buffer
|
::LI line buffer
|
||||||
~~
|
~~
|
||||||
|
~~
|
||||||
|
~~
|
||||||
|
~~
|
||||||
::L$ end of current label list
|
::L$ end of current label list
|
||||||
--LB
|
--LB
|
||||||
::LB labels
|
::LB labels
|
||||||
|
|
6
03/in03
6
03/in03
|
@ -1,6 +1,6 @@
|
||||||
; write to stdout
|
|
||||||
B=:hello_world
|
B=:hello_world
|
||||||
call :puts
|
call :puts
|
||||||
|
; exit code 0
|
||||||
J=d0
|
J=d0
|
||||||
syscall x3c
|
syscall x3c
|
||||||
|
|
||||||
|
@ -11,15 +11,15 @@ x0
|
||||||
|
|
||||||
; output null-terminated string in rbx
|
; output null-terminated string in rbx
|
||||||
:puts
|
:puts
|
||||||
|
R=B
|
||||||
call :strlen
|
call :strlen
|
||||||
I=D
|
|
||||||
D=A
|
D=A
|
||||||
|
I=R
|
||||||
J=d1
|
J=d1
|
||||||
syscall d1
|
syscall d1
|
||||||
return
|
return
|
||||||
|
|
||||||
; calculate length of string in rbx
|
; calculate length of string in rbx
|
||||||
; keeps pointer to start of string in rdx, end of string in rsi
|
|
||||||
:strlen
|
:strlen
|
||||||
; keep pointer to start of string
|
; keep pointer to start of string
|
||||||
D=B
|
D=B
|
||||||
|
|
|
@ -24,6 +24,7 @@ hexadecimal digit pairs to a binary file.
|
||||||
- [stage 01](01/README.md) - a language with comments, and 2-character
|
- [stage 01](01/README.md) - a language with comments, and 2-character
|
||||||
command codes.
|
command codes.
|
||||||
- [stage 02](02/README.md) - a language with labels
|
- [stage 02](02/README.md) - a language with labels
|
||||||
|
- [stage 03](03/README.md) - a language with longer labels, better error messages, and less register manipulation
|
||||||
- more coming soon (hopefully)
|
- more coming soon (hopefully)
|
||||||
|
|
||||||
## prerequisite knowledge
|
## prerequisite knowledge
|
||||||
|
@ -93,10 +94,10 @@ compile GCC, say, and so all programs around today could be compromised. Of
|
||||||
course, this is practically definitely not the case, but it's still an
|
course, this is practically definitely not the case, but it's still an
|
||||||
interesting experiment to try to create a fully trustable compiler. This
|
interesting experiment to try to create a fully trustable compiler. This
|
||||||
project can't necessarily even do that though, because the Linux kernel, which
|
project can't necessarily even do that though, because the Linux kernel, which
|
||||||
we depend on, is compiled from C, so we can't fully trust *it*. To *truly*
|
we depend on, is compiled from C, so we can't fully trust *it*. To
|
||||||
create a fully trustable compiler, you'd need to manually write to a USB with a
|
create a *fully* trustable compiler, you'd need to manually write
|
||||||
circuit, create an operating system from nothing (without even a text editor),
|
an operating system to a USB key with a circuit or something,
|
||||||
and then follow this series, or maybe you don't even trust your CPU...
|
assuming you trust your CPU...
|
||||||
I'll leave that to someone else.
|
I'll leave that to someone else.
|
||||||
|
|
||||||
## license
|
## license
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue