04b initial readme, guessing game, compiler fixes

This commit is contained in:
pommicket 2022-01-06 23:29:59 -05:00
parent 3e73f6625c
commit 4cd2b7047c
8 changed files with 625 additions and 103 deletions

3
.gitignore vendored
View file

@ -1,4 +1,7 @@
README.html README.html
out?? out??
out??? out???
*.out
tags
TAGS
markdown markdown

View file

@ -1,7 +1,11 @@
all: out03 all: out03 guessing_game.out out04b README.html
out03: in03 ../03/out02 out03: in03 ../03/out02
../03/out02 ../03/out02
%.html: %.md ../markdown %.html: %.md ../markdown
../markdown $< ../markdown $<
out04b: in04b out03
./out03
%.out: % out03
./out03 $< $@
clean: clean:
rm -f out* README.html rm -f out* README.html *.out

240
04b/README.md Normal file
View file

@ -0,0 +1,240 @@
# stage 04
As usual, the source for this compiler is `in03`, an input to the [previous compiler](../03/README.md).
`in04b` contains a hello world program written in the stage 4 language.
Here is the core of the program:
```
main()
function main
puts(.str_hello_world)
putc(10) ; newline
syscall(0x3c, 0)
```
As you can see, we can now pass arguments to functions. And let's take a look at `putc`:
```
function putc
argument c
local p
p = &c
syscall(1, 1, p, 1)
return
```
It's so simple compared to previous languages! Rather than mess around with registers, we can now
declare local (and global) variables, and use them directly. These variables will be placed on the
stack. Since arguments are also placed on the stack,
by implementing local variables we get arguments for free. There is no difference
between the `local` and `argument` keywords in this language other than spelling.
In fact, the number of agruments to a function call is not checked against
how many arguments the function has. This does make it easy to screw things up by calling a function
with the wrong number of arguments, but it also means that we can provide a variable number of arguments
to the `syscall` function. Speaking of which, if you look at the bottom of `in04b`, you'll see:
```
function syscall
...
byte 0x48
byte 0x8b
byte 0x85
byte 0xf0
byte 0xff
byte 0xff
byte 0xff
...
```
Originally I was going to make `syscall` a built-in feature of the language, but then I realized that wasn't
necessary.
Instead, `syscall` is a function written manually in machine language.
We can take a look at its decompilation to make things clearer:
```
mov rax,[rbp-0x10]
mov rdi,rax
mov rax,[rbp-0x18]
mov rsi,rax
mov rax,[rbp-0x20]
mov rdx,rax
mov rax,[rbp-0x28]
mov r10,rax
mov rax,[rbp-0x30]
mov r8,rax
mov rax,[rbp-0x38]
mov r9,rax
mov rax,[rbp-0x8]
syscall
```
This just sets `rax`, `rdi`, `rsi`, etc. to the arguments the function was called with,
and then does a syscall.
## functions and local variables
In this language, function arguments are placed onto the stack from left to right
and all arguments and local variables are 8 bytes.
As a reminder,
the stack is just an area of memory which is automatically extended downwards (on x86-64, at least).
So, how do we keep track of the location of local variables in the stack? We could do something like
this:
```
sub rsp, 24 ; make room for 3 variables
mov [rsp], 10 ; variable1 = 10
mov [rsp+8], 20 ; variable2 = 20
mov [rsp+16], 30 ; variable3 = 30
; ...
add rsp, 24 ; reset rsp
```
But now suppose that in the middle of the `; ...` code we want another local variable:
```
sub rsp, 8 ; make room for another variable
```
well, since we've changed `rsp`, `variable1` is now at `rsp+8` instead of `rsp`,
`variable2` is at `rsp+16` instead of `rsp+8`, and
`variable3` is at `rsp+24` instead of `rsp+16`.
Also, we had better make sure we increment `rsp` by `32` now instead of `24`
to put it back in the right place.
It would be annoying (but by no means impossible) to keep track of all this.
We could just declare all local variables at the start of the function,
but that makes the language more annoying to use.
Instead, we can use the `rbp` register to keep track of what `rsp` was
at the start of the function:
```
; save old value of rbp
sub rsp, 8
mov [rsp], rbp
; set rbp to initial value of rsp
mov rbp, rsp
lea rsp, [rbp-8] ; add variable1 (this instruction sets rsp to rbp-8)
mov [rbp-8], 10 ; variable1 = 10
lea rsp, [rbp-16] ; add variable2
mov [rbp-16], 20 ; variable2 = 20
lea rsp, [rbp-24] ; add variable3
mov [rbp-24], 30 ; variable3 = 30
; Note that variable1's address is still rbp-8; adding more variables didn't affect it.
; ...
; restore old values of rbp and rsp
mov rsp, rbp
mov rbp, [rsp]
add rsp, 8
```
This is actually the intended use of `rbp` (it *p*oints to the *b*ase of the stack frame).
Note that setting `rsp` very specifically rather than just doing `sub rsp, 8` is important:
if we skip over some code with a local variable declaration, or execute a local declaration twice,
we want `rsp` to be in the right place.
The first three and last three instructions above are called the function *prologue* and *epilogue*.
They are all the same for all functions; a prologue is generated at the start of every function,
and an epilogue is generated for every return statement.
The return value is placed in `rax`.
## global variables
Global variables are much simpler than local ones. The variable `:static_memory_end` in the compiler
keeps track of where to put the next global variable in memory. It is initialized at address `0x440000`,
which gives us 256KB for code (and strings). When a global variable is added, `:static_memory_end` is increased
by its size.
## language description
Comments begin with `;` and may be put at the end of lines
with or without code.
Blank lines are ignored.
To make the compiler simpler, this language doesn't support fancy
expressions like `2 * (3 + 5) / 6`. There is a limited set of possible
expressions, specifically there are *terms* and *r-values*.
But first, each program is made up of a series of statements, and
each statement is one of the following:
- `global {name}` or `global {size} {name}` - declare a global variable with the given size, or 8 bytes if none is provided.
- `local {name}` - declare a local variable
- `argument {name}` - declare a function argument. this is functionally equivalent to `local`, so it just exists for readability.
- `function {name}` - declare a function
- `:{name}` - declare a label
- `goto {label}` - jump to the specified label
- `if {term} {operator} {term} goto {label}` -
conditionally jump to the specified label. `{operator}` should be one of
`==`, `<`, `>`, `>=`, `<=`, `!=`, `[`, `]`, `[=`, `]=`
(the last four do unsigned comparisons).
- `{lvalue} = {rvalue}` - set `lvalue` to `rvalue`
- `{lvalue} += {rvalue}` - add `rvalue` to `lvalue`
- `{lvalue} -= {rvalue}` - etc.
- `{lvalue} *= {rvalue}`
- `{lvalue} /= {rvalue}`
- `{lvalue} %= {rvalue}`
- `{lvalue} &= {rvalue}`
- `{lvalue} |= {rvalue}`
- `{lvalue} ^= {rvalue}`
- `{lvalue} <= {rvalue}` - left shift `lvalue` by `rvalue`
- `{lvalue} >= {rvalue}` - right shift `lvalue` by `rvalue`
- `{function}({term}, {term}, ...)` - function call, ignoring the return value
- `return {rvalue}`
- `string {str}` - places a literal string in the code
- `byte {number}` - places a literal byte in the code
Now let's get down into the weeds:
A a *number* is one of:
- `{decimal number}` - e.g. `108` (note: there's no `d` prefix anymore)
- `0x{hexadecimal number}` - e.g. `0x2f` for 47
- `'{character}` - e.g. `'a` for 97 (the character code for `a`)
A *term* is one of:
- `{variable name}` - the value of a (local or global) variable
- `.{label name}` - the address of a label
- `{number}`
An *lvalue* is the left-hand side of an assignment expression,
and it is one of:
- `{variable}`
- `*1{variable}` - dereference 1 byte
- `*2{variable}` - dereference 2 bytes
- `*4{variable}` - dereference 4 bytes
- `*8{variable}` - dereference 8 bytes
An *rvalue* is an expression, which can be more complicated than a term.
rvalues are one of:
- `{term}`
- `&{variable}` - address of variable
- `*1{variable}` / `*2{variable}` / `*4{variable}` / `*8{variable}` - dereference 1, 2, 4, or 8 bytes
- `~{term}` - bitwise not
- `{function}({term}, {term}, ...)`
- `{term} + {term}`
- `{term} - {term}`
- `{term} * {term}`
- `{term} / {term}`
- `{term} % {term}`
- `{term} & {term}`
- `{term} | {term}`
- `{term} ^ {term}`
- `{term} < {term}` - left shift
- `{term} > {term}` - right shift
That's quite a lot of stuff, and it makes for a pretty powerful
language, all things considered. To test out the language,
in addition to the hello world program, I also wrote a little
guessing game, which you can find in the file `guessing_game`.
It ended up being quite nice to write!
## limitations
Variables in this language do not have types. This makes it very easy to make mistakes like
treating numbers as pointers or vice versa.
A big annoyance with this language is the lack of local label names. Due to the limited nature
of branching in this language (`if ... goto ...` stands in for `if`, `else if`, `while`, etc.),
you need to use a lot of labels, and that means their names can get quite long. But at least unlike
the 03 language, you'll get an error if you use the same label name twice!
Overall, though, this language ended up being surprisingly powerful. With any luck, the next stage will
finally be a C compiler...

238
04b/guessing_game Normal file
View file

@ -0,0 +1,238 @@
global 0x1000 exit_code
global y
y = 4
exit_code = main()
exit(exit_code)
function main
local secret_number
local guess
global 32 input_line
local p_line
p_line = &input_line
secret_number = getrand(100)
fputs(1, .str_intro)
:guess_loop
fputs(1, .str_guess)
syscall(0, 0, p_line, 30)
guess = stoi(p_line)
if guess < secret_number goto too_low
if guess > secret_number goto too_high
fputs(1, .str_got_it)
return 0
:too_low
fputs(1, .str_too_low)
goto guess_loop
:too_high
fputs(1, .str_too_high)
goto guess_loop
:str_intro
string I'm thinking of a number.
byte 10
byte 0
:str_guess
string Guess what it is:
byte 32
byte 0
:str_got_it
string You got it!
byte 10
byte 0
:str_too_low
string Too low!
byte 10
byte 0
:str_too_high
string Too high!
byte 10
byte 0
; get a "random" number from 0 to x using the system clock
function getrand
argument x
global 16 getrand_time
local ptime
local n
ptime = &getrand_time
syscall(228, 1, ptime)
ptime += 8 ; nanoseconds at offset 8 in struct timespec
n = *4ptime
n %= x
return n
; returns a pointer to a null-terminated string containing the number given
function itos
global 32 itos_string
argument x
local c
local p
p = &itos_string
p += 30
:itos_loop
c = x % 10
c += '0
*1p = c
x /= 10
if x == 0 goto itos_loop_end
p -= 1
goto itos_loop
:itos_loop_end
return p
; returns the number at the start of the given string
function stoi
argument s
local p
local n
local c
n = 0
p = s
:stoi_loop
c = *1p
if c < '0 goto stoi_loop_end
if c > '9 goto stoi_loop_end
n *= 10
n += c - '0
p += 1
goto stoi_loop
:stoi_loop_end
return n
function strlen
argument s
local c
local p
p = s
:strlen_loop
c = *1p
if c == 0 goto strlen_loop_end
p += 1
goto strlen_loop
:strlen_loop_end
return p - s
function fputs
argument fd
argument s
local length
length = strlen(s)
syscall(1, fd, s, length)
return
function fputn
argument fd
argument n
local s
s = itos(n)
fputs(fd, s)
return
function exit
argument status_code
syscall(0x3c, status_code)
function syscall
; I've done some testing, and this should be okay even if
; rbp-56 goes beyond the end of the stack.
; mov rax, [rbp-16]
byte 0x48
byte 0x8b
byte 0x85
byte 0xf0
byte 0xff
byte 0xff
byte 0xff
; mov rdi, rax
byte 0x48
byte 0x89
byte 0xc7
; mov rax, [rbp-24]
byte 0x48
byte 0x8b
byte 0x85
byte 0xe8
byte 0xff
byte 0xff
byte 0xff
; mov rsi, rax
byte 0x48
byte 0x89
byte 0xc6
; mov rax, [rbp-32]
byte 0x48
byte 0x8b
byte 0x85
byte 0xe0
byte 0xff
byte 0xff
byte 0xff
; mov rdx, rax
byte 0x48
byte 0x89
byte 0xc2
; mov rax, [rbp-40]
byte 0x48
byte 0x8b
byte 0x85
byte 0xd8
byte 0xff
byte 0xff
byte 0xff
; mov r10, rax
byte 0x49
byte 0x89
byte 0xc2
; mov rax, [rbp-48]
byte 0x48
byte 0x8b
byte 0x85
byte 0xd0
byte 0xff
byte 0xff
byte 0xff
; mov r8, rax
byte 0x49
byte 0x89
byte 0xc0
; mov rax, [rbp-56]
byte 0x48
byte 0x8b
byte 0x85
byte 0xc8
byte 0xff
byte 0xff
byte 0xff
; mov r9, rax
byte 0x49
byte 0x89
byte 0xc1
; mov rax, [rbp-8]
byte 0x48
byte 0x8b
byte 0x85
byte 0xf8
byte 0xff
byte 0xff
byte 0xff
; syscall
byte 0x0f
byte 0x05
return

149
04b/in03
View file

@ -4,14 +4,18 @@ D=:global_variables
8C=D 8C=D
; initialize static_memory_end ; initialize static_memory_end
C=:static_memory_end C=:static_memory_end
; 0x40000 = 256KB for code ; 0x80000 = 512KB for code
D=x440000 D=x480000
8C=D 8C=D
; initialize labels_end ; initialize labels_end
C=:labels_end C=:labels_end
D=:labels D=:labels
8C=D 8C=D
I=8S
A=d2
?I>A:argv_file_names
; use default input/output filenames
; open input file ; open input file
J=:input_filename J=:input_filename
I=d0 I=d0
@ -25,6 +29,28 @@ D=:labels
syscall x2 syscall x2
J=A J=A
?J<0:output_file_error ?J<0:output_file_error
!:second_pass_starting_point
:argv_file_names
; open input file
J=S
; argv[1] is at *(rsp+16)
J+=d16
J=8J
I=d0
syscall x2
J=A
?J<0:input_file_error
; open output file
J=S
; argv[2] is at *(rsp+24)
J+=d24
J=8J
I=x241
D=x1ed
syscall x2
J=A
?J<0:output_file_error
:second_pass_starting_point :second_pass_starting_point
; write ELF header ; write ELF header
@ -161,15 +187,16 @@ call :string=
D=A D=A
?D!0:handle_if ?D!0:handle_if
; set delimiter to newline
C=xa
I=:line I=:line
J=:"function" J=:"function"
call :string= call :string=
D=A D=A
?D!0:handle_function ?D!0:handle_function
; set delimiter to newline
C=xa
I=:line I=:line
J=:"return\n" J=:"return\n"
call :string= call :string=
@ -203,6 +230,7 @@ I=:line
!:call_check_loop !:call_check_loop
:call_check_loop_end :call_check_loop_end
!:bad_statement
!:read_line !:read_line
@ -217,6 +245,7 @@ I=:line
J=d4 J=d4
I=:static_memory_end I=:static_memory_end
I=8I I=8I
I-=x400000
syscall x4d syscall x4d
; seek both files back to start ; seek both files back to start
J=d3 J=d3
@ -292,15 +321,6 @@ align
!:read_line !:read_line
:handle_local :handle_local
R=I
; emit sub rsp, 8
J=d4
I=:sub_rsp_8
D=d7
syscall x1
I=R
; skip ' ' ; skip ' '
I+=d1 I+=d1
@ -333,23 +353,36 @@ align
; update :local_variables_end ; update :local_variables_end
I=:local_variables_end I=:local_variables_end
8I=J 8I=J
; set rsp appropriately
C=:rbp_offset
J=d0
J-=D
4C=J
J=d4
I=:lea_rsp_[rbp_offset]
D=d7
syscall x1
; read the next line ; read the next line
!:read_line !:read_line
:sub_rsp_8 :lea_rsp_[rbp_offset]
x48 x48
x81 x8d
xec xa5
x08 :rbp_offset
x00 reserve d4
x00
x00
align align
:global_start :global_start
reserve d8 reserve d8
:global_variable_name :global_variable_name
reserve d8 reserve d8
:global_variable_size
reserve d8
:handle_global :handle_global
; ignore if this is the second pass ; ignore if this is the second pass
C=:second_pass C=:second_pass
@ -359,6 +392,27 @@ align
; skip ' ' ; skip ' '
I+=d1 I+=d1
C=1I
D='9
?C>D:global_default_size
; read specific size of global
call :read_number
D=A
C=:global_variable_size
8C=D
; check and skip space after number
C=1I
D=x20
?C!D:bad_number
I+=d1
!:global_cont
:global_default_size
; default size = 8
C=:global_variable_size
D=d8
8C=D
:global_cont
; store away pointer to variable name ; store away pointer to variable name
C=:global_variable_name C=:global_variable_name
8C=I 8C=I
@ -380,8 +434,11 @@ align
C=4D C=4D
4J=C 4J=C
J+=d4 J+=d4
; increase static_memory_end ; increase static_memory_end by size
C+=d8 D=:global_variable_size
D=8D
C+=D
D=:static_memory_end
4D=C 4D=C
; store null terminator ; store null terminator
1J=0 1J=0
@ -392,6 +449,12 @@ align
!:read_line !:read_line
:handle_function :handle_function
I=:line
; length of "function "
I+=d9
; make function name a label
call :add_label
; emit prologue ; emit prologue
J=d4 J=d4
I=:function_prologue I=:function_prologue
@ -450,14 +513,25 @@ align
; total length = 15 bytes ; total length = 15 bytes
:handle_label_definition :handle_label_definition
I=:line
I+=d1
call :add_label
!:read_line
align
:label_name
reserve d8
; add the label in rsi to the label list (with the current pc address)
:add_label
; ignore if this is the second pass ; ignore if this is the second pass
C=:second_pass C=:second_pass
C=1C C=1C
?C!0:read_line ?C!0:return_0
C=:label_name
8C=I
; make sure label only has identifier characters ; make sure label only has identifier characters
I=:line
I+=d1
:label_checking_loop :label_checking_loop
C=1I C=1I
D=xa D=xa
@ -470,8 +544,8 @@ align
!:bad_label !:bad_label
:label_checking_loop_end :label_checking_loop_end
I=:line C=:label_name
I+=d1 I=8C
J=:labels J=:labels
call :ident_lookup call :ident_lookup
C=A C=A
@ -479,8 +553,8 @@ align
J=:labels_end J=:labels_end
J=8J J=8J
I=:line C=:label_name
I+=d1 I=8C
call :ident_copy call :ident_copy
R=J R=J
@ -500,8 +574,7 @@ align
C=:labels_end C=:labels_end
8C=J 8C=J
; read the next line return
!:read_line
:handle_goto :handle_goto
J=d4 J=d4
@ -2004,6 +2077,15 @@ align
xa xa
x0 x0
:bad_statement
B=:bad_statement_error_message
!:program_error
:bad_statement_error_message
str Bad statement.
xa
x0
:bad_jump :bad_jump
B=:bad_jump_error_message B=:bad_jump_error_message
!:program_error !:program_error
@ -2205,6 +2287,7 @@ align
1J=D 1J=D
J-=d1 J-=d1
?I!0:eputn_loop ?I!0:eputn_loop
J+=d1
D=S D=S
D-=J D-=J
I=J I=J
@ -2271,7 +2354,7 @@ align
x20 x20
:"function" :"function"
str function str function
xa x20
:"==" :"=="
str == str ==
x20 x20

View file

@ -1,93 +1,42 @@
; declaration:
; global <name>
; local <name>
; argument <name>
; :<label>
; statement:
; <declaration>
; if <term> <==/</>/>=/<=/!=/[/]/[=/]=> <term> goto <label> NOTE: this uses signed comparisons
; goto <label>
; <lvalue> = <rvalue>
; <lvalue> += <rvalue>
; <lvalue> -= <rvalue>
; <function>(<term>, <term>, ...)
; return <rvalue>
; string <str>
; byte <number>
; term:
; <var>
; .<label>
; <number>
; number:
; 'c
; 12345
; 0xabc
; lvalue:
; <var>
; *1<var> / *2<var> / *4<var> / *8<var>
; rvalue:
; <term>
; &<var>
; *1<var> / *2<var> / *4<var> / *8<var>
; ~<term>
; <function>(<term>, <term>, ...)
; <term> + <term>
; <term> - <term>
; NOTE: *, /, % are signed (imul and idiv)
; <term> * <term>
; <term> / <term>
; <term> % <term>
; <term> & <term>
; <term> | <term>
; <term> ^ <term>
; <term> < <term> (left shift)
; <term> > <term> (unsigned right shift)
main() main()
:main function main
function
puts(.str_hello_world) puts(.str_hello_world)
putc(10) ; newline putc(10) ; newline
syscall(0x3c, 0) syscall(0x3c, 0)
:str_hello_world :str_hello_world
string Hello, world! string Hello, world!
byte 0 byte 0
:strlen function strlen
function
argument s argument s
local len
local c local c
local p local p
len = 0 p = s
:strlen_loop :strlen_loop
p = s + len
c = *1p c = *1p
if c == 0 goto strlen_loop_end if c == 0 goto strlen_loop_end
len += 1 p += 1
goto strlen_loop goto strlen_loop
:strlen_loop_end :strlen_loop_end
return len return p - s
:putc function putc
function
argument c argument c
local p local p
p = &c p = &c
syscall(1, 1, p, 1) syscall(1, 1, p, 1)
return return
:puts function puts
function
argument s argument s
local len local len
len = strlen(s) len = strlen(s)
syscall(1, 1, s, len) syscall(1, 1, s, len)
return return
:syscall function syscall
function
; I've done some testing, and this should be okay even if ; I've done some testing, and this should be okay even if
; rbp-56 goes beyond the end of the stack. ; rbp-56 goes beyond the end of the stack.
; mov rax, [rbp-16] ; mov rax, [rbp-16]

View file

@ -26,6 +26,8 @@ command codes.
- [stage 02](02/README.md) - a language with labels - [stage 02](02/README.md) - a language with labels
- [stage 03](03/README.md) - a language with longer labels, better error messages, and less register manipulation - [stage 03](03/README.md) - a language with longer labels, better error messages, and less register manipulation
- more coming soon (hopefully) - more coming soon (hopefully)
- [stage 04a](04a/README.md) - (interlude) a very simple preprocessor
- [stage 04b](04b/README.md) - a language with nice functions and local variables
## prerequisite knowledge ## prerequisite knowledge
@ -46,6 +48,7 @@ decimal.
- what a CPU is - what a CPU is
- what a CPU architecture is - what a CPU architecture is
- what a CPU register is - what a CPU register is
- what the (call) stack is
- bits, bytes, kilobytes, etc. - bits, bytes, kilobytes, etc.
- bitwise operations (not, or, and, xor, left shift, right shift) - bitwise operations (not, or, and, xor, left shift, right shift)
- 2's complement - 2's complement

View file

@ -43,6 +43,8 @@ mov rax, qword [rbp+imm32]
>48 8b 85 IMM32 (note: imm may be negative) >48 8b 85 IMM32 (note: imm may be negative)
lea rax, [rbp+imm32] lea rax, [rbp+imm32]
>48 8d 85 IMM32 (note: imm may be negative) >48 8d 85 IMM32 (note: imm may be negative)
lea rsp, [rbp+imm32]
>48 8d a5 IMM32 (note: imm may be negative)
mov qword [rbp+imm32], rax mov qword [rbp+imm32], rax
>48 89 85 IMM32 (note: imm may be negative) >48 89 85 IMM32 (note: imm may be negative)
mov qword [rsp+imm32], rax mov qword [rsp+imm32], rax