readme tweaks, mainly
This commit is contained in:
parent
3255cd32d7
commit
2288e47516
13 changed files with 177 additions and 84 deletions
|
@ -3,3 +3,5 @@ out00: in00
|
||||||
./hexcompile
|
./hexcompile
|
||||||
%.html: %.md ../markdown
|
%.html: %.md ../markdown
|
||||||
../markdown $<
|
../markdown $<
|
||||||
|
clean:
|
||||||
|
rm -f out00 README.html
|
||||||
|
|
26
00/README.md
26
00/README.md
|
@ -102,7 +102,7 @@ execute-enabled. Normally people don't do this, for security, but we won't worry
|
||||||
about that (don't compile any untrusted code with any compiler from this series!)
|
about that (don't compile any untrusted code with any compiler from this series!)
|
||||||
Without further ado, here's the contents of the program header:
|
Without further ado, here's the contents of the program header:
|
||||||
|
|
||||||
- `01 00 00 00` Segment type 1 (this should be loaded into memory)
|
- `01 00 00 00` Segment type 1 (this segment should be loaded into memory)
|
||||||
- `07 00 00 00` Flags = RWE (readable, writeable, and executable)
|
- `07 00 00 00` Flags = RWE (readable, writeable, and executable)
|
||||||
- `78 00 00 00 00 00 00 00` Offset in file = 120 bytes
|
- `78 00 00 00 00 00 00 00` Offset in file = 120 bytes
|
||||||
- `78 00 40 00 00 00 00 00` Virtual address = 0x400078
|
- `78 00 40 00 00 00 00 00` Virtual address = 0x400078
|
||||||
|
@ -114,7 +114,7 @@ memory address that the segment will be loaded to.
|
||||||
Nowadays, computers use virtual memory, meaning that
|
Nowadays, computers use virtual memory, meaning that
|
||||||
addresses in our program don't actually correspond to where the memory is
|
addresses in our program don't actually correspond to where the memory is
|
||||||
physically stored in RAM (the CPU translates between virtual and physical
|
physically stored in RAM (the CPU translates between virtual and physical
|
||||||
memory addresses). There are many reasons for this: making sure each process has
|
addresses). There are many reasons for this: making sure each process has
|
||||||
its own memory space, memory protection, etc. You can read more about it
|
its own memory space, memory protection, etc. You can read more about it
|
||||||
elsewhere.
|
elsewhere.
|
||||||
|
|
||||||
|
@ -130,7 +130,7 @@ each page (block) of memory is 4096 bytes long, and has to start at an address
|
||||||
that is a multiple of 4096. Our program needs to be loaded into a memory page,
|
that is a multiple of 4096. Our program needs to be loaded into a memory page,
|
||||||
so its *virtual address* needs to be a multiple of 4096. We're using `0x400000`.
|
so its *virtual address* needs to be a multiple of 4096. We're using `0x400000`.
|
||||||
But wait! Didn't we use `0x400078` for the virtual address? Well, yes but that's
|
But wait! Didn't we use `0x400078` for the virtual address? Well, yes but that's
|
||||||
because the *data in the file* is loaded to address `0x400078`. The actual page
|
because the segment's data is loaded to address `0x400078`. The actual page
|
||||||
of memory that the OS will allocate for our segment will start at `0x400000`. The
|
of memory that the OS will allocate for our segment will start at `0x400000`. The
|
||||||
reason we need to start `0x78` bytes in is that Linux expects the data in the
|
reason we need to start `0x78` bytes in is that Linux expects the data in the
|
||||||
file to be at the same position in the page as when it will be loaded, and it
|
file to be at the same position in the page as when it will be loaded, and it
|
||||||
|
@ -156,7 +156,8 @@ These instructions execute syscall `2` with arguments `0x40026d`, `0`.
|
||||||
If you're familiar with C code, this is `open("in00", O_RDONLY)`.
|
If you're familiar with C code, this is `open("in00", O_RDONLY)`.
|
||||||
A syscall is the mechanism which lets software ask the kernel to do things.
|
A syscall is the mechanism which lets software ask the kernel to do things.
|
||||||
[Here](https://filippo.io/linux-syscall-table/) is a nice table of syscalls you
|
[Here](https://filippo.io/linux-syscall-table/) is a nice table of syscalls you
|
||||||
can look through if you're interested. You can also install `strace` (e.g. with
|
can look through if you're interested. You can also install
|
||||||
|
[strace](https://strace.io) (e.g. with
|
||||||
`sudo apt install strace`) and run `strace ./hexcompile` to see all the syscalls
|
`sudo apt install strace`) and run `strace ./hexcompile` to see all the syscalls
|
||||||
our program does.
|
our program does.
|
||||||
Syscall #2, on 64-bit Linux, is `open`. It's used to open a file. You can read
|
Syscall #2, on 64-bit Linux, is `open`. It's used to open a file. You can read
|
||||||
|
@ -175,13 +176,13 @@ descriptor Linux gave us. This is because Linux assigns file descriptor numbers
|
||||||
sequentially, starting from
|
sequentially, starting from
|
||||||
[0 for stdin, 1 for stdout, 2 for stderr](https://en.wikipedia.org/wiki/Standard_streams),
|
[0 for stdin, 1 for stdout, 2 for stderr](https://en.wikipedia.org/wiki/Standard_streams),
|
||||||
and then 3, 4, 5, ... for any files our program opens. So
|
and then 3, 4, 5, ... for any files our program opens. So
|
||||||
this file, the first one our program opens, will have descriptor `3`.
|
this file, the first one our program opens, will have descriptor 3.
|
||||||
|
|
||||||
Now we open our output file:
|
Now we open our output file:
|
||||||
|
|
||||||
- `48 b8 72 02 40 00 00 00 00 00` `mov rax, 0x400272`
|
- `48 b8 72 02 40 00 00 00 00 00` `mov rax, 0x400272`
|
||||||
- `48 89 c7` `mov rdi, rax`
|
- `48 89 c7` `mov rdi, rax`
|
||||||
- `48 b8 41 02 00 00 00 00 00 00` `mov rax, 0x41`
|
- `48 b8 41 02 00 00 00 00 00 00` `mov rax, 0x241`
|
||||||
- `48 89 c6` `mov rsi, rax`
|
- `48 89 c6` `mov rsi, rax`
|
||||||
- `48 b8 ed 01 00 00 00 00 00 00` `mov rax, 0o755`
|
- `48 b8 ed 01 00 00 00 00 00 00` `mov rax, 0o755`
|
||||||
- `48 89 c2` `mov rdx, rax`
|
- `48 89 c2` `mov rdx, rax`
|
||||||
|
@ -193,11 +194,12 @@ similar to our first call, with two important differences: first, we specify
|
||||||
`0x241` as the second argument. This tells Linux that we are writing to the
|
`0x241` as the second argument. This tells Linux that we are writing to the
|
||||||
file (`O_WRONLY = 0x01`), that we want to create it if it doesn't exist
|
file (`O_WRONLY = 0x01`), that we want to create it if it doesn't exist
|
||||||
(`O_CREAT = 0x40`), and that we want to delete any previous contents it had
|
(`O_CREAT = 0x40`), and that we want to delete any previous contents it had
|
||||||
(`O_TRUNC = 0x200`). Secondly, we are setting the third argument this time. It
|
(`O_TRUNC = 0x200`). Secondly, we're setting the third argument this time. It
|
||||||
specifies the permissions our file is created with (`0o755` means user
|
specifies the permissions our file is created with (`0o755` means user
|
||||||
read/write/execute, group/other read/execute). This is not very important to
|
read/write/execute, group/other read/execute). This is not very important to
|
||||||
the actual execution of the program, so don't worry if you don't know
|
the actual execution of the program, so don't worry if you don't know
|
||||||
about UNIX permissions.
|
about UNIX permissions.
|
||||||
|
Note that the output file's descriptor will be 4.
|
||||||
|
|
||||||
Now we can start reading from the file. We're going to loop back to this part of
|
Now we can start reading from the file. We're going to loop back to this part of
|
||||||
the code every time we want to read a new hexadecimal number from the input
|
the code every time we want to read a new hexadecimal number from the input
|
||||||
|
@ -223,13 +225,13 @@ We're telling Linux to output to `0x40026a`, which is just a part of this
|
||||||
segment (see further down). Normally you would read to a different segment of
|
segment (see further down). Normally you would read to a different segment of
|
||||||
the program from where the code is, but we want this to be as simple as
|
the program from where the code is, but we want this to be as simple as
|
||||||
possible.
|
possible.
|
||||||
The number of bytes *actually read*, taking into account that we might have
|
The number of bytes *actually* read, taking into account that we might have
|
||||||
reached the end of the file, is stored in `rax`.
|
reached the end of the file, is stored in `rax`.
|
||||||
|
|
||||||
- `48 89 c3` `mov rbx, rax`
|
- `48 89 c3` `mov rbx, rax`
|
||||||
- `48 b8 03 00 00 00 00 00 00 00` `mov rax, 3`
|
- `48 b8 03 00 00 00 00 00 00 00` `mov rax, 3`
|
||||||
- `48 39 d8` `cmp rax, rbx`
|
- `48 39 d8` `cmp rax, rbx`
|
||||||
- `0f 8f 50 01 00 00` `jg 0x400250`
|
- `0f 8f 50 01 00 00` `jg +0x150 (0x400250)`
|
||||||
|
|
||||||
This tells the CPU to jump to a later part of the code (address `0x400250`) if 3
|
This tells the CPU to jump to a later part of the code (address `0x400250`) if 3
|
||||||
is greater than the number of bytes we got, in other words, if we reached the
|
is greater than the number of bytes we got, in other words, if we reached the
|
||||||
|
@ -307,7 +309,7 @@ Okay, now `rax` contains the byte specified by the two hex digits we read.
|
||||||
- `48 93` `xchg rax, rbx`
|
- `48 93` `xchg rax, rbx`
|
||||||
- `88 03` `mov byte [rbx], al`
|
- `88 03` `mov byte [rbx], al`
|
||||||
|
|
||||||
Write the byte to a specific memory location (address `0x40026c`).
|
Put the byte in a specific memory location (address `0x40026c`).
|
||||||
|
|
||||||
- `48 b8 04 00 00 00 00 00 00 00` `mov rax, 4`
|
- `48 b8 04 00 00 00 00 00 00 00` `mov rax, 4`
|
||||||
- `48 89 c7` `mov rdi, rax`
|
- `48 89 c7` `mov rdi, rax`
|
||||||
|
@ -356,7 +358,7 @@ This is where we conditionally jumped to way back when we determined if we
|
||||||
reached the end of the file. This calls syscall #60, `exit`, with one argument,
|
reached the end of the file. This calls syscall #60, `exit`, with one argument,
|
||||||
0 (exit code 0, indicating we exited successfully).
|
0 (exit code 0, indicating we exited successfully).
|
||||||
|
|
||||||
Normally, you should close files descriptors (with syscall #3), to tell Linux you're
|
Normally, you would close files descriptors (with syscall #3), to tell Linux you're
|
||||||
done with them, but we don't need to. It'll automatically close all our open
|
done with them, but we don't need to. It'll automatically close all our open
|
||||||
file descriptors when our program exits.
|
file descriptors when our program exits.
|
||||||
|
|
||||||
|
@ -387,4 +389,4 @@ a while.
|
||||||
But these problems aren't really a big deal. We'll only be running this on
|
But these problems aren't really a big deal. We'll only be running this on
|
||||||
little programs and we'll be sure to check that our input is in the right
|
little programs and we'll be sure to check that our input is in the right
|
||||||
format. And with that, we are ready to move on to the
|
format. And with that, we are ready to move on to the
|
||||||
[next stage...](../01/README.md).
|
[next stage...](../01/README.md)
|
||||||
|
|
|
@ -5,3 +5,5 @@ out00: in00
|
||||||
../00/hexcompile
|
../00/hexcompile
|
||||||
%.html: %.md ../markdown
|
%.html: %.md ../markdown
|
||||||
../markdown $<
|
../markdown $<
|
||||||
|
clean:
|
||||||
|
rm -f out00 out01 README.html
|
||||||
|
|
32
01/README.md
32
01/README.md
|
@ -8,7 +8,7 @@ is the executable for this stage's compiler. Run it (it'll read from the file
|
||||||
`Hello, world!` when run. Let's take a look at the input we're providing to the
|
`Hello, world!` when run. Let's take a look at the input we're providing to the
|
||||||
stage 01 compiler, `in01`:
|
stage 01 compiler, `in01`:
|
||||||
|
|
||||||
<pre><code>
|
```
|
||||||
|| ELF Header
|
|| ELF Header
|
||||||
;im;01;00;00;00;00;00;00;00 file descriptor for stdout
|
;im;01;00;00;00;00;00;00;00 file descriptor for stdout
|
||||||
;JA
|
;JA
|
||||||
|
@ -24,9 +24,9 @@ stage 01 compiler, `in01`:
|
||||||
;sy
|
;sy
|
||||||
;'H;'e;'l;'l;'o;',;' ;'w;'o;'r;'l;'d;'!;\n the string we're printing
|
;'H;'e;'l;'l;'o;',;' ;'w;'o;'r;'l;'d;'!;\n the string we're printing
|
||||||
;
|
;
|
||||||
</code></pre>
|
```
|
||||||
|
|
||||||
Look at that! There are comments! Much nicer than just hexadecimal digit pairs.
|
Look at that! There are even comments! Much nicer than just hexadecimal digit pairs.
|
||||||
|
|
||||||
## end result
|
## end result
|
||||||
|
|
||||||
|
@ -50,9 +50,9 @@ actually print out an error message and exit, rather than continuing as if
|
||||||
nothing happened! Try adding `xx;` to the end of the file `in01`, and running
|
nothing happened! Try adding `xx;` to the end of the file `in01`, and running
|
||||||
`./out00`. You should get the error message:
|
`./out00`. You should get the error message:
|
||||||
|
|
||||||
<pre><code>
|
```
|
||||||
xx not recognized.
|
xx not recognized.
|
||||||
</code></pre>
|
```
|
||||||
|
|
||||||
Pretty cool, huh?
|
Pretty cool, huh?
|
||||||
Anyways let's see how this compiler actually works.
|
Anyways let's see how this compiler actually works.
|
||||||
|
@ -63,7 +63,7 @@ Writing in our stage 00 language is much nicer than editing an
|
||||||
executable, because it's easier to move things around, and also, we can separate
|
executable, because it's easier to move things around, and also, we can separate
|
||||||
our program into lines! Let's take a look at the start:
|
our program into lines! Let's take a look at the start:
|
||||||
|
|
||||||
<pre><code>
|
```
|
||||||
7f 45 4c 46
|
7f 45 4c 46
|
||||||
02
|
02
|
||||||
01
|
01
|
||||||
|
@ -90,7 +90,7 @@ a8 00 40 00 00 00 00 00
|
||||||
00 10 02 00 00 00 00 00
|
00 10 02 00 00 00 00 00
|
||||||
00 10 02 00 00 00 00 00
|
00 10 02 00 00 00 00 00
|
||||||
00 10 00 00 00 00 00 00
|
00 10 00 00 00 00 00 00
|
||||||
</code></pre>
|
```
|
||||||
|
|
||||||
This is the ELF header and program header. It's just like our last one, but with
|
This is the ELF header and program header. It's just like our last one, but with
|
||||||
a couple of differences. First, our entry point is at offset 0xa8 instead of 0x78.
|
a couple of differences. First, our entry point is at offset 0xa8 instead of 0x78.
|
||||||
|
@ -113,7 +113,7 @@ recognized."`
|
||||||
- `00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00` (unused)
|
- `00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00` (unused)
|
||||||
|
|
||||||
Here's the data for our program. As you can see from my annotations, we have the
|
Here's the data for our program. As you can see from my annotations, we have the
|
||||||
input and output file, as well as the error message. The command part of the
|
input and output file names, as well as the error message. The command part of the
|
||||||
error message is left blank for now (we'll fill it in when the code is actually
|
error message is left blank for now (we'll fill it in when the code is actually
|
||||||
run).
|
run).
|
||||||
|
|
||||||
|
@ -182,8 +182,8 @@ program with exit code 0 (successful).
|
||||||
- `48 01 d8` `add rax, rbx`
|
- `48 01 d8` `add rax, rbx`
|
||||||
|
|
||||||
This here looks at the two bytes we read in (we'll call them `b1` and `b2`) and
|
This here looks at the two bytes we read in (we'll call them `b1` and `b2`) and
|
||||||
computes `b1 * 128 + b2` (more specifically `(b1 << 7) + b2`). This is the index
|
computes `b1 * 128 + b2` (more specifically `(b1 << 7) + b2`). This is the corresponding index
|
||||||
in our command table corresponding to the two characters from the input file.
|
in our command table.
|
||||||
|
|
||||||
- `48 c1 e0 03` `shl rax, 3`
|
- `48 c1 e0 03` `shl rax, 3`
|
||||||
- `48 89 c3` `mov rbx, rax`
|
- `48 89 c3` `mov rbx, rax`
|
||||||
|
@ -211,7 +211,7 @@ is `03 48 89 c3`. We set the length to 0 for unused entries.
|
||||||
So this code checks if the entry for this command starts with a zero byte. If it
|
So this code checks if the entry for this command starts with a zero byte. If it
|
||||||
does, that means the two characters we read in don't actually correspond to a
|
does, that means the two characters we read in don't actually correspond to a
|
||||||
real command. If that's the case, this next bit of code is executed (otherwise
|
real command. If that's the case, this next bit of code is executed (otherwise
|
||||||
it's skiped over):
|
it's skipped over):
|
||||||
|
|
||||||
- `48 b8 02 00 00 00 00 00 00 00` `mov rax, 2 (stderr)`
|
- `48 b8 02 00 00 00 00 00 00 00` `mov rax, 2 (stderr)`
|
||||||
- `48 89 c7` `mov rdi, rax`
|
- `48 89 c7` `mov rdi, rax`
|
||||||
|
@ -228,7 +228,7 @@ it's skiped over):
|
||||||
- `00 00 00 00 00 00 00 00 00 00 00 00 00 00` (unused)
|
- `00 00 00 00 00 00 00 00 00 00 00 00 00 00` (unused)
|
||||||
|
|
||||||
This prints our error message, now filled in with the specific unrecognized
|
This prints our error message, now filled in with the specific unrecognized
|
||||||
instruction, to standard error, and exits with code 1, to indicate failure.
|
instruction, to standard error, then exits with code 1, to indicate failure.
|
||||||
|
|
||||||
- `48 89 eb` `mov rbx, rax`
|
- `48 89 eb` `mov rbx, rax`
|
||||||
- `31 c0` `mov rax, 0`
|
- `31 c0` `mov rax, 0`
|
||||||
|
@ -273,7 +273,7 @@ all the way back to read the next command. Otherwise, we keep looping. This
|
||||||
skips over any comments/whitespace we might have between a command and the
|
skips over any comments/whitespace we might have between a command and the
|
||||||
following command.
|
following command.
|
||||||
|
|
||||||
And that's all the *code* for this compiler. Next comes some data.
|
And that's all the *code* for this compiler. Next comes the command table.
|
||||||
|
|
||||||
First, there's a whole bunch of unused 0s. Then there's the line
|
First, there's a whole bunch of unused 0s. Then there's the line
|
||||||
|
|
||||||
|
@ -293,7 +293,7 @@ Which is the encoding of the `syscall` instruction.
|
||||||
You can look through the rest of the table, if you want. But let's look at the
|
You can look through the rest of the table, if you want. But let's look at the
|
||||||
very end:
|
very end:
|
||||||
|
|
||||||
<code><pre>
|
```
|
||||||
78
|
78
|
||||||
7f 45 4c 46
|
7f 45 4c 46
|
||||||
02
|
02
|
||||||
|
@ -321,7 +321,7 @@ very end:
|
||||||
00 00 08 00 00 00 00 00
|
00 00 08 00 00 00 00 00
|
||||||
00 00 08 00 00 00 00 00
|
00 00 08 00 00 00 00 00
|
||||||
00 10 00 00 00 00 00 00
|
00 10 00 00 00 00 00 00
|
||||||
</code></pre>
|
```
|
||||||
|
|
||||||
This is at the position for `||`, and it contains an ELF header. One thing you
|
This is at the position for `||`, and it contains an ELF header. One thing you
|
||||||
might notice is that we decided that each entry is 8 bytes long, but this one is
|
might notice is that we decided that each entry is 8 bytes long, but this one is
|
||||||
|
@ -340,5 +340,5 @@ fixed this, but frankly I've had enough of writing code in hexadecimal. So let's
|
||||||
move on to [stage 02](../02/README.md),
|
move on to [stage 02](../02/README.md),
|
||||||
now that we have a nicer language on our hands. From now
|
now that we have a nicer language on our hands. From now
|
||||||
on, since we have comments, I'm gonna do most of the explaining in the source file
|
on, since we have comments, I'm gonna do most of the explaining in the source file
|
||||||
itself, rather than the README. But there'll still be a bit of stuff there each
|
itself, rather than the README. But there'll still be some stuff there each
|
||||||
time.
|
time.
|
||||||
|
|
|
@ -7,11 +7,12 @@ ff - Byte ff
|
||||||
'a - Character a (byte 0x61)
|
'a - Character a (byte 0x61)
|
||||||
'! - Character ! (byte 0x21)
|
'! - Character ! (byte 0x21)
|
||||||
etc.
|
etc.
|
||||||
|
\n - Newline (byte 0x0a)
|
||||||
|
|
||||||
zA - Zero rax
|
zA - Zero rax
|
||||||
im - Set rax to an immediate value, e.g.
|
im - Set rax to an immediate value, e.g.
|
||||||
im;05;00;00;00;00;00;00;00;
|
im;05;00;00;00;00;00;00;00;
|
||||||
will set rax to 5.
|
will set rax to 5.
|
||||||
|
|
||||||
ax bx cx dx sp bp si di
|
ax bx cx dx sp bp si di
|
||||||
A B C D S R I J
|
A B C D S R I J
|
||||||
|
|
|
@ -1,7 +1,9 @@
|
||||||
all: out01 out02 README.html
|
all: out01 out02 README.html
|
||||||
out01: in01
|
out01: in01
|
||||||
../01/out00
|
../01/out00
|
||||||
out02: out01
|
out02: out01 in02
|
||||||
./out01
|
./out01
|
||||||
%.html: %.md ../markdown
|
%.html: %.md ../markdown
|
||||||
../markdown $<
|
../markdown $<
|
||||||
|
clean:
|
||||||
|
rm -f out01 out02 README.html
|
||||||
|
|
116
02/README.md
116
02/README.md
|
@ -1,13 +1,15 @@
|
||||||
# stage 02
|
# stage 02
|
||||||
|
|
||||||
The compiler for this stage is in the file `in01`, an input for our previous compiler.
|
The compiler for this stage is in the file `in01`, an input for our previous compiler.
|
||||||
The specifics of how this compiler works are in the comments in that file, but here I'll
|
So if you run `../01/out00`, you'll get the file `out01`, which is
|
||||||
|
this stage's compiler.
|
||||||
|
The specifics of how this compiler works are in the comments in `in01`, but here I'll
|
||||||
give an overview.
|
give an overview.
|
||||||
Let's take a look at `in02`, an example input file for this compiler:
|
Let's take a look at `in02`, an example input file for this compiler:
|
||||||
```
|
```
|
||||||
jm
|
jm
|
||||||
:-co jump to code
|
:-co jump to code
|
||||||
::hw
|
::hw start of hello world
|
||||||
'H
|
'H
|
||||||
'e
|
'e
|
||||||
'l
|
'l
|
||||||
|
@ -23,11 +25,12 @@ jm
|
||||||
'!
|
'!
|
||||||
\n
|
\n
|
||||||
::he end of hello world
|
::he end of hello world
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
::co start of code
|
::co start of code
|
||||||
//
|
// calculate the length of the hello world string
|
||||||
// now we'll calculate the length of the hello world string
|
|
||||||
// by subtracting hw from he.
|
// by subtracting hw from he.
|
||||||
//
|
|
||||||
im
|
im
|
||||||
--he
|
--he
|
||||||
BA
|
BA
|
||||||
|
@ -36,7 +39,7 @@ im
|
||||||
nA
|
nA
|
||||||
+B
|
+B
|
||||||
DA put length in rdx
|
DA put length in rdx
|
||||||
// okay now we can write it
|
// okay now write it
|
||||||
im
|
im
|
||||||
##1.
|
##1.
|
||||||
JA set rdi to 1 (stdout)
|
JA set rdi to 1 (stdout)
|
||||||
|
@ -54,56 +57,123 @@ im
|
||||||
sy
|
sy
|
||||||
```
|
```
|
||||||
|
|
||||||
You can try adding more characters to the hello world message, and it'll just work;
|
We can compile it by running `./out01`. This will produce
|
||||||
the length of the text is computed automatically!
|
the executable `out02`, which you can run. It prints
|
||||||
|
`Hello, world!`.
|
||||||
|
|
||||||
This time, commands are separated by newlines instead of semicolons.
|
In this language,
|
||||||
Each line begins with a 2-character command identifier. There are some special identifiers though:
|
commands are separated by newlines instead of semicolons.
|
||||||
|
Each line begins with a 2-character command.
|
||||||
|
All of the commands from the previous compiler are here,
|
||||||
|
plus six new ones:
|
||||||
|
|
||||||
- `::` marks a *label*
|
- `::` marks a *label*
|
||||||
- `--` outputs a label's (absolute) address
|
- `--` outputs a label's (absolute) address
|
||||||
- `:-` outputs a label's relative address
|
- `:-` outputs a label's relative address
|
||||||
- `##` outputs a number
|
- `##` outputs a number
|
||||||
|
- `//` is for comments
|
||||||
All other commands work like they did in the previous compiler—if you scroll down in the
|
- `\n\n` does nothing (used for spacing)
|
||||||
`in01` source file, you'll see the full command table.
|
|
||||||
|
|
||||||
## labels
|
## labels
|
||||||
|
|
||||||
Labels are the most important new feature of this language.
|
Labels are the most important new feature of this language.
|
||||||
|
A line like
|
||||||
|
```
|
||||||
|
::xy
|
||||||
|
```
|
||||||
|
associates the name `xy` with the address of the next byte of the program.
|
||||||
|
In the example program, `hw` is associated with `0x40007d`,
|
||||||
|
which is the virtual memory address of the `Hello, world!` data.
|
||||||
|
We can then use
|
||||||
|
```
|
||||||
|
--xy
|
||||||
|
```
|
||||||
|
to output that address, and
|
||||||
|
```
|
||||||
|
:-xy
|
||||||
|
```
|
||||||
|
to output it relative to the current address.
|
||||||
|
So now instead of computing how far to jump, we can just jump to a label, e.g.
|
||||||
|
```
|
||||||
|
jm
|
||||||
|
:-xy (use the relative address, because jumps are relative in x86-64)
|
||||||
|
```
|
||||||
|
And instead of figuring out the address of a piece of data, we can just use its label:
|
||||||
|
```
|
||||||
|
im
|
||||||
|
--xy
|
||||||
|
// rax now points to the data at the label "::xy"
|
||||||
|
```
|
||||||
|
|
||||||
|
This also lets us compute the length of the hello world string automatically!
|
||||||
|
By taking the address of the end of the string (`he`) and subtracting the
|
||||||
|
start (`hw`), we get the length in bytes.
|
||||||
|
So you can try adding more characters to the hello world message, and it'll just work.
|
||||||
|
|
||||||
|
All labels must be two ASCII characters. The address of each label is stored
|
||||||
|
as a 32-bit number in the "label table". This is sort of like the command table—the
|
||||||
|
index of the label `xy` is `128 * x + y`. Specifically, the entry for `xy` is at
|
||||||
|
`0x420000 + 4 * (128 * x + y)`, since the label table starts at `0x420000`
|
||||||
|
and each entry is 4 bytes.
|
||||||
|
When we encounter `::xy`, we get the current position in the output file
|
||||||
|
(using `lseek`), add the address of the start of the file (`0x400000`),
|
||||||
|
and store that in the label table.
|
||||||
|
When we encounter `:-xy` or `--xy`, we look up `xy` in the label table,
|
||||||
|
and write the address (subtracting the current address for `:-`) to the output file.
|
||||||
|
|
||||||
## two passes?
|
## two passes?
|
||||||
|
|
||||||
|
This compiler actually needs to read through the source code,
|
||||||
|
and output an executable, twice.
|
||||||
|
This is because a label may be defined *after* it is used, e.g.:
|
||||||
|
```
|
||||||
|
jm
|
||||||
|
:-aa jump forward
|
||||||
|
...
|
||||||
|
::aa this is where we're jumping to
|
||||||
|
...
|
||||||
|
```
|
||||||
|
In the first pass, the `:-aa` will
|
||||||
|
treat `aa` as having an address of 0. Then when
|
||||||
|
we get to `::aa`, the address in the label table will be corrected.
|
||||||
|
At the end of the first pass, we seek back to the start
|
||||||
|
of the input and output files,
|
||||||
|
and run the exact same code for the second pass.
|
||||||
|
But this time, the correct address of `aa` is used, namely the
|
||||||
|
one we calculated in the first pass.
|
||||||
|
|
||||||
|
|
||||||
## other features
|
## other features
|
||||||
|
|
||||||
Now instead of writing out each of the 8 bytes making up a number,
|
Now instead of writing out each of the 8 bytes making up a number,
|
||||||
we can just write it in hexadecimal (e.g. `##3c.` for `3c 00 00 00 00 00 00 00`),
|
we can just write it in hexadecimal, e.g. `##1c4.` for `c4 01 00 00 00 00 00 00`.
|
||||||
and the compiler will automatically
|
|
||||||
extend it to 8 bytes.
|
|
||||||
This is especially nice because we don't need to write numbers backwards
|
This is especially nice because we don't need to write numbers backwards
|
||||||
for little-endianness anymore!
|
for little-endianness anymore!
|
||||||
Numbers cannot appear at the end of a line (this was
|
Numbers cannot appear at the end of a line (this made
|
||||||
to make the compiler simpler to write), so I'm adding a `.` at the end of
|
the compiler simpler to write), so I'm adding a `.` at the end of
|
||||||
each one to avoid making that mistake.
|
each one to avoid making that mistake.
|
||||||
|
|
||||||
Anything after a command is treated as a comment;
|
Anything after a command is treated as a comment;
|
||||||
additionally `//` can be used for comments on their own lines.
|
additionally `//` can be used for comments on their own lines.
|
||||||
I decided to implement them as simply as possible:
|
I decided to implement this as simply as possible:
|
||||||
I just added the command `//` to the command table, which outputs the byte `0x90`—this
|
I just added the command `//` to the command table, which outputs the byte `0x90`—this
|
||||||
means "do nothing" (`nop`) in x86-64.
|
means ["do nothing"](https://en.wikipedia.org/wiki/No-op)
|
||||||
Note that this means that the following code will not work as expected:
|
in x86-64.
|
||||||
|
Note that the following code will not work as expected:
|
||||||
```
|
```
|
||||||
im
|
im
|
||||||
// load the value 0x333 into rax
|
// load the value 0x333 into rax
|
||||||
##333.
|
##333.
|
||||||
```
|
```
|
||||||
since `0x90` gets inserted between the "load immediate" instruction code, and the immediate.
|
since `0x90` gets inserted between the "load immediate" instruction code and the immediate.
|
||||||
|
`\n\n` works identically, and lets us space out code a bit. But be careful:
|
||||||
|
the number of blank lines must be a multiple of 3!
|
||||||
|
|
||||||
## limitations
|
## limitations
|
||||||
|
|
||||||
Many of the limitations of our previous compilers apply to this one. Also,
|
Many of the limitations of our previous compilers apply to this one. Also,
|
||||||
if you use a label without defining it, it uses address 0, rather than outputting
|
if you use a label without defining it, it uses address 0, rather than outputting
|
||||||
an error message. This could be fixed: if the value in the label table is 0, and if we are
|
an error message. This could be fixed: if the value in the label table is 0 and we are
|
||||||
on the second pass, output an error message. This compiler was already tedious enough
|
on the second pass, output an error message. This compiler was already tedious enough
|
||||||
to implement, though!
|
to implement, though!
|
||||||
But thanks to labels, for future compilers at least we won't have to calculate
|
But thanks to labels, for future compilers at least we won't have to calculate
|
||||||
|
|
24
02/in01
24
02/in01
|
@ -3,7 +3,7 @@
|
||||||
;'i;'n;'0;'2;00 (0x40007d) input filename
|
;'i;'n;'0;'2;00 (0x40007d) input filename
|
||||||
;'o;'u;'t;'0;'2;00 (0x400082) output filename
|
;'o;'u;'t;'0;'2;00 (0x400082) output filename
|
||||||
;00;00;' ;'n;'o;'t;' ;'r;'e;'c;'o;'g;'n;'i;'z;'e;'d;\n;00;00;00;00;00;00 (0x400088) error message/where we read to
|
;00;00;' ;'n;'o;'t;' ;'r;'e;'c;'o;'g;'n;'i;'z;'e;'d;\n;00;00;00;00;00;00 (0x400088) error message/where we read to
|
||||||
;00 (0x4000a0) stores which pass we're on (1 for second pass)
|
;00 (0x4000a0) stores which pass we're on (0 for first pass, 1 for second pass)
|
||||||
;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00
|
||||||
;00;00;00;00;00;00;00;00 (0x4000a8) used for output
|
;00;00;00;00;00;00;00;00 (0x4000a8) used for output
|
||||||
unused padding
|
unused padding
|
||||||
|
@ -180,11 +180,11 @@ okay it's 0-9
|
||||||
|
|
||||||
;+B
|
;+B
|
||||||
;BA
|
;BA
|
||||||
okay we now have a digit in RBX
|
okay we now have a digit in rbx
|
||||||
;AR
|
;AR
|
||||||
;<I;04
|
;<I;04
|
||||||
;+B
|
;+B
|
||||||
;RA store away in RBP
|
;RA store away in rbp
|
||||||
;jm;38;ff;ff;ff continue loop
|
;jm;38;ff;ff;ff continue loop
|
||||||
|
|
||||||
unused padding
|
unused padding
|
||||||
|
@ -195,7 +195,7 @@ unused padding
|
||||||
;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00
|
||||||
;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00
|
||||||
|
|
||||||
okay we have a full number in RBP, time to write it to the file
|
okay we have a full number in rbp, time to write it to the file.
|
||||||
start by putting it at address 0x4000a8
|
start by putting it at address 0x4000a8
|
||||||
;im;a8;00;40;00;00;00;00;00
|
;im;a8;00;40;00;00;00;00;00
|
||||||
;BA
|
;BA
|
||||||
|
@ -210,7 +210,7 @@ now write
|
||||||
;IA
|
;IA
|
||||||
;im;08;00;00;00;00;00;00;00 write 8 bytes
|
;im;08;00;00;00;00;00;00;00 write 8 bytes
|
||||||
;DA
|
;DA
|
||||||
;im;01;00;00;00;00;00;00;00 write
|
;im;01;00;00;00;00;00;00;00 write
|
||||||
;sy
|
;sy
|
||||||
|
|
||||||
;jm;c3;03;00;00 skip to newline
|
;jm;c3;03;00;00 skip to newline
|
||||||
|
@ -327,11 +327,11 @@ subtract current address
|
||||||
;nA;+B
|
;nA;+B
|
||||||
;RA store relative address in rbp
|
;RA store relative address in rbp
|
||||||
|
|
||||||
now we want to write eax to the output file.
|
now we want to write ebp to the output file.
|
||||||
start by putting it at address 0x4000a8
|
start by putting it at address 0x4000a8
|
||||||
;im;a8;00;40;00;00;00;00;00
|
;im;a8;00;40;00;00;00;00;00
|
||||||
;BA
|
;BA
|
||||||
;AR put relative address in rax
|
;AR
|
||||||
;sd
|
;sd
|
||||||
|
|
||||||
now write
|
now write
|
||||||
|
@ -341,7 +341,7 @@ now write
|
||||||
;IA
|
;IA
|
||||||
;im;04;00;00;00;00;00;00;00 4 bytes
|
;im;04;00;00;00;00;00;00;00 4 bytes
|
||||||
;DA
|
;DA
|
||||||
;im;01;00;00;00;00;00;00;00 write
|
;im;01;00;00;00;00;00;00;00 write
|
||||||
;sy
|
;sy
|
||||||
|
|
||||||
;jm;66;01;00;00 skip to newline
|
;jm;66;01;00;00 skip to newline
|
||||||
|
@ -368,7 +368,7 @@ it's not a label or a number. let's look it up in the instruction table.
|
||||||
;BA
|
;BA
|
||||||
;RA store away address of command text in rbp
|
;RA store away address of command text in rbp
|
||||||
;zA;lb
|
;zA;lb
|
||||||
;DA number of bytes to write (used for syscall if no error)
|
;DA number of bytes to write (used for syscall if command exists)
|
||||||
;BA
|
;BA
|
||||||
;zA
|
;zA
|
||||||
;cm;jn;54;00;00;00 check if # of bytes is 0, if not, skip outputting error
|
;cm;jn;54;00;00;00 check if # of bytes is 0, if not, skip outputting error
|
||||||
|
@ -392,7 +392,7 @@ this is a real command
|
||||||
;im;01;00;00;00;00;00;00;00 add 1 because we don't want to write the length
|
;im;01;00;00;00;00;00;00;00 add 1 because we don't want to write the length
|
||||||
;+B
|
;+B
|
||||||
;IA address of data to write
|
;IA address of data to write
|
||||||
;im;04;00;00;00;00;00;00;00 out file descriptor
|
;im;04;00;00;00;00;00;00;00 out file descriptor
|
||||||
;JA
|
;JA
|
||||||
;im;01;00;00;00;00;00;00;00 write
|
;im;01;00;00;00;00;00;00;00 write
|
||||||
;sy
|
;sy
|
||||||
|
@ -1777,7 +1777,7 @@ the formatting changed appropriately.
|
||||||
;00;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00;00
|
||||||
;00;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00;00
|
||||||
;00;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00;00
|
||||||
;00;00;00;00;00;00;00;00
|
;01;90;00;00;00;00;00;00 \n\n
|
||||||
;00;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00;00
|
||||||
;00;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00;00
|
||||||
;00;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00;00
|
||||||
|
@ -6550,7 +6550,7 @@ the formatting changed appropriately.
|
||||||
;00;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00;00
|
||||||
;00;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00;00
|
||||||
;00;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00;00
|
||||||
;01;90;00;00;00;00;00;00
|
;01;90;00;00;00;00;00;00 // comments
|
||||||
;00;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00;00
|
||||||
;00;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00;00
|
||||||
;00;00;00;00;00;00;00;00
|
;00;00;00;00;00;00;00;00
|
||||||
|
|
11
02/in02
11
02/in02
|
@ -1,6 +1,6 @@
|
||||||
jm
|
jm
|
||||||
:-co jump to code
|
:-co jump to code
|
||||||
::hw
|
::hw start of hello world
|
||||||
'H
|
'H
|
||||||
'e
|
'e
|
||||||
'l
|
'l
|
||||||
|
@ -16,11 +16,12 @@ jm
|
||||||
'!
|
'!
|
||||||
\n
|
\n
|
||||||
::he end of hello world
|
::he end of hello world
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
::co start of code
|
::co start of code
|
||||||
//
|
// calculate the length of the hello world string
|
||||||
// now we'll calculate the length of the hello world string
|
|
||||||
// by subtracting hw from he.
|
// by subtracting hw from he.
|
||||||
//
|
|
||||||
im
|
im
|
||||||
--he
|
--he
|
||||||
BA
|
BA
|
||||||
|
@ -29,7 +30,7 @@ im
|
||||||
nA
|
nA
|
||||||
+B
|
+B
|
||||||
DA put length in rdx
|
DA put length in rdx
|
||||||
// okay now we can write it
|
// okay now write it
|
||||||
im
|
im
|
||||||
##1.
|
##1.
|
||||||
JA set rdi to 1 (stdout)
|
JA set rdi to 1 (stdout)
|
||||||
|
|
6
Makefile
6
Makefile
|
@ -2,6 +2,12 @@ all: markdown README.html
|
||||||
$(MAKE) -C 00
|
$(MAKE) -C 00
|
||||||
$(MAKE) -C 01
|
$(MAKE) -C 01
|
||||||
$(MAKE) -C 02
|
$(MAKE) -C 02
|
||||||
|
clean:
|
||||||
|
$(MAKE) -C 00 clean
|
||||||
|
$(MAKE) -C 01 clean
|
||||||
|
$(MAKE) -C 02 clean
|
||||||
|
rm -f markdown
|
||||||
|
rm -f README.html
|
||||||
markdown: markdown.c
|
markdown: markdown.c
|
||||||
$(CC) -O2 -o markdown -Wall -Wconversion -Wshadow -std=c89 markdown.c
|
$(CC) -O2 -o markdown -Wall -Wconversion -Wshadow -std=c89 markdown.c
|
||||||
README.html: markdown README.md
|
README.html: markdown README.md
|
||||||
|
|
27
README.md
27
README.md
|
@ -17,7 +17,14 @@ Note that the executables produced in this series will only run on
|
||||||
64-bit Linux, because each OS/architecture combination would need its own separate
|
64-bit Linux, because each OS/architecture combination would need its own separate
|
||||||
executable.
|
executable.
|
||||||
|
|
||||||
The README for the first stage is [here](00/README.md).
|
## table of contents
|
||||||
|
|
||||||
|
- [stage 00](00/README.md) - a program converting a text file with
|
||||||
|
hexadecimal digit pairs to a binary file.
|
||||||
|
- [stage 01](01/README.md) - a language with comments, and 2-character
|
||||||
|
command codes.
|
||||||
|
- [stage 02](02/README.md) - a language with labels
|
||||||
|
- more coming soon (hopefully)
|
||||||
|
|
||||||
## prerequisite knowledge
|
## prerequisite knowledge
|
||||||
|
|
||||||
|
@ -44,8 +51,7 @@ decimal.
|
||||||
- ASCII, null-terminated strings
|
- ASCII, null-terminated strings
|
||||||
- how pointers work
|
- how pointers work
|
||||||
- how floating-point numbers work
|
- how floating-point numbers work
|
||||||
- maybe some basic Intel-style x86-64 assembly (you can probably pick it up on
|
- some basic Intel-style x86-64 assembly
|
||||||
the way though)
|
|
||||||
|
|
||||||
It will help you a lot to know how to program (with any programming language),
|
It will help you a lot to know how to program (with any programming language),
|
||||||
but it's not strictly necessary.
|
but it's not strictly necessary.
|
||||||
|
@ -53,12 +59,11 @@ but it's not strictly necessary.
|
||||||
## instruction set
|
## instruction set
|
||||||
|
|
||||||
x86-64 has a *gigantic* instruction set. The manual for it is over 2,000 pages
|
x86-64 has a *gigantic* instruction set. The manual for it is over 2,000 pages
|
||||||
long! So, it makes sense to select only a small subset of it to use for all the
|
long! So it makes sense to select only a small subset of it to use.
|
||||||
stages of our compiler. The set I've chosen can be found in `instructions.txt`.
|
The set I've chosen can be found in `instructions.txt`.
|
||||||
I think it achieves a pretty good balance between having few enough
|
I think it achieves a pretty good balance between having few enough
|
||||||
instructions to be manageable and having enough instructions to be useable.
|
instructions to be manageable and having enough instructions to be useable.
|
||||||
To be clear, you don't need to read that file to understand the series, at least
|
To be clear, you don't need to read that file to understand the series.
|
||||||
not right away.
|
|
||||||
|
|
||||||
## principles
|
## principles
|
||||||
|
|
||||||
|
@ -91,15 +96,15 @@ project can't necessarily even do that though, because the Linux kernel, which
|
||||||
we depend on, is compiled from C, so we can't fully trust *it*. To *truly*
|
we depend on, is compiled from C, so we can't fully trust *it*. To *truly*
|
||||||
create a fully trustable compiler, you'd need to manually write to a USB with a
|
create a fully trustable compiler, you'd need to manually write to a USB with a
|
||||||
circuit, create an operating system from nothing (without even a text editor),
|
circuit, create an operating system from nothing (without even a text editor),
|
||||||
and then follow this series, or maybe you don't even trust your CPU vendor...
|
and then follow this series, or maybe you don't even trust your CPU...
|
||||||
I'll leave that to someone else
|
I'll leave that to someone else.
|
||||||
|
|
||||||
## license
|
## license
|
||||||
|
|
||||||
```
|
```
|
||||||
This project is in the public domain. Any copyright protections from any law
|
This project is in the public domain. Any copyright protections from any law
|
||||||
for this project are forfeited by the author(s). No warranty is provided for
|
are forfeited by the author(s). No warranty is provided, and the author(s)
|
||||||
this project, and the author(s) shall not be held liable in connection with it.
|
shall not be held liable in connection with it.
|
||||||
```
|
```
|
||||||
|
|
||||||
## contributing
|
## contributing
|
||||||
|
|
|
@ -101,3 +101,4 @@ syscall
|
||||||
>0f 05
|
>0f 05
|
||||||
nop
|
nop
|
||||||
>90
|
>90
|
||||||
|
(more will be added as needed)
|
||||||
|
|
|
@ -58,7 +58,8 @@ static void output_md_text(FILE *out, int *flags, int line_number, const char *t
|
||||||
case '[': {
|
case '[': {
|
||||||
/* link */
|
/* link */
|
||||||
char url2[256] = {0};
|
char url2[256] = {0};
|
||||||
const char *label, *url, *label_end, *url_end, *dot;
|
const char *label, *url, *label_end, *url_end;
|
||||||
|
char *dot;
|
||||||
int n_label, n_url;
|
int n_label, n_url;
|
||||||
|
|
||||||
label = p+1;
|
label = p+1;
|
||||||
|
@ -88,7 +89,7 @@ static void output_md_text(FILE *out, int *flags, int line_number, const char *t
|
||||||
/* replace links to md files with links to html files */
|
/* replace links to md files with links to html files */
|
||||||
strcpy(dot, ".html");
|
strcpy(dot, ".html");
|
||||||
}
|
}
|
||||||
fprintf(out, "<a href=\"%s\" target=\"_blank\">%.*s</a>",
|
fprintf(out, "<a href=\"%s\">%.*s</a>",
|
||||||
url2, n_label, label);
|
url2, n_label, label);
|
||||||
p = url_end;
|
p = url_end;
|
||||||
} break;
|
} break;
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue