readme tweaks, mainly

This commit is contained in:
pommicket 2021-11-10 12:55:41 -05:00
parent 3255cd32d7
commit 2288e47516
13 changed files with 177 additions and 84 deletions

View file

@ -3,3 +3,5 @@ out00: in00
./hexcompile ./hexcompile
%.html: %.md ../markdown %.html: %.md ../markdown
../markdown $< ../markdown $<
clean:
rm -f out00 README.html

View file

@ -102,7 +102,7 @@ execute-enabled. Normally people don't do this, for security, but we won't worry
about that (don't compile any untrusted code with any compiler from this series!) about that (don't compile any untrusted code with any compiler from this series!)
Without further ado, here's the contents of the program header: Without further ado, here's the contents of the program header:
- `01 00 00 00` Segment type 1 (this should be loaded into memory) - `01 00 00 00` Segment type 1 (this segment should be loaded into memory)
- `07 00 00 00` Flags = RWE (readable, writeable, and executable) - `07 00 00 00` Flags = RWE (readable, writeable, and executable)
- `78 00 00 00 00 00 00 00` Offset in file = 120 bytes - `78 00 00 00 00 00 00 00` Offset in file = 120 bytes
- `78 00 40 00 00 00 00 00` Virtual address = 0x400078 - `78 00 40 00 00 00 00 00` Virtual address = 0x400078
@ -114,7 +114,7 @@ memory address that the segment will be loaded to.
Nowadays, computers use virtual memory, meaning that Nowadays, computers use virtual memory, meaning that
addresses in our program don't actually correspond to where the memory is addresses in our program don't actually correspond to where the memory is
physically stored in RAM (the CPU translates between virtual and physical physically stored in RAM (the CPU translates between virtual and physical
memory addresses). There are many reasons for this: making sure each process has addresses). There are many reasons for this: making sure each process has
its own memory space, memory protection, etc. You can read more about it its own memory space, memory protection, etc. You can read more about it
elsewhere. elsewhere.
@ -130,7 +130,7 @@ each page (block) of memory is 4096 bytes long, and has to start at an address
that is a multiple of 4096. Our program needs to be loaded into a memory page, that is a multiple of 4096. Our program needs to be loaded into a memory page,
so its *virtual address* needs to be a multiple of 4096. We're using `0x400000`. so its *virtual address* needs to be a multiple of 4096. We're using `0x400000`.
But wait! Didn't we use `0x400078` for the virtual address? Well, yes but that's But wait! Didn't we use `0x400078` for the virtual address? Well, yes but that's
because the *data in the file* is loaded to address `0x400078`. The actual page because the segment's data is loaded to address `0x400078`. The actual page
of memory that the OS will allocate for our segment will start at `0x400000`. The of memory that the OS will allocate for our segment will start at `0x400000`. The
reason we need to start `0x78` bytes in is that Linux expects the data in the reason we need to start `0x78` bytes in is that Linux expects the data in the
file to be at the same position in the page as when it will be loaded, and it file to be at the same position in the page as when it will be loaded, and it
@ -156,7 +156,8 @@ These instructions execute syscall `2` with arguments `0x40026d`, `0`.
If you're familiar with C code, this is `open("in00", O_RDONLY)`. If you're familiar with C code, this is `open("in00", O_RDONLY)`.
A syscall is the mechanism which lets software ask the kernel to do things. A syscall is the mechanism which lets software ask the kernel to do things.
[Here](https://filippo.io/linux-syscall-table/) is a nice table of syscalls you [Here](https://filippo.io/linux-syscall-table/) is a nice table of syscalls you
can look through if you're interested. You can also install `strace` (e.g. with can look through if you're interested. You can also install
[strace](https://strace.io) (e.g. with
`sudo apt install strace`) and run `strace ./hexcompile` to see all the syscalls `sudo apt install strace`) and run `strace ./hexcompile` to see all the syscalls
our program does. our program does.
Syscall #2, on 64-bit Linux, is `open`. It's used to open a file. You can read Syscall #2, on 64-bit Linux, is `open`. It's used to open a file. You can read
@ -175,13 +176,13 @@ descriptor Linux gave us. This is because Linux assigns file descriptor numbers
sequentially, starting from sequentially, starting from
[0 for stdin, 1 for stdout, 2 for stderr](https://en.wikipedia.org/wiki/Standard_streams), [0 for stdin, 1 for stdout, 2 for stderr](https://en.wikipedia.org/wiki/Standard_streams),
and then 3, 4, 5, ... for any files our program opens. So and then 3, 4, 5, ... for any files our program opens. So
this file, the first one our program opens, will have descriptor `3`. this file, the first one our program opens, will have descriptor 3.
Now we open our output file: Now we open our output file:
- `48 b8 72 02 40 00 00 00 00 00` `mov rax, 0x400272` - `48 b8 72 02 40 00 00 00 00 00` `mov rax, 0x400272`
- `48 89 c7` `mov rdi, rax` - `48 89 c7` `mov rdi, rax`
- `48 b8 41 02 00 00 00 00 00 00` `mov rax, 0x41` - `48 b8 41 02 00 00 00 00 00 00` `mov rax, 0x241`
- `48 89 c6` `mov rsi, rax` - `48 89 c6` `mov rsi, rax`
- `48 b8 ed 01 00 00 00 00 00 00` `mov rax, 0o755` - `48 b8 ed 01 00 00 00 00 00 00` `mov rax, 0o755`
- `48 89 c2` `mov rdx, rax` - `48 89 c2` `mov rdx, rax`
@ -193,11 +194,12 @@ similar to our first call, with two important differences: first, we specify
`0x241` as the second argument. This tells Linux that we are writing to the `0x241` as the second argument. This tells Linux that we are writing to the
file (`O_WRONLY = 0x01`), that we want to create it if it doesn't exist file (`O_WRONLY = 0x01`), that we want to create it if it doesn't exist
(`O_CREAT = 0x40`), and that we want to delete any previous contents it had (`O_CREAT = 0x40`), and that we want to delete any previous contents it had
(`O_TRUNC = 0x200`). Secondly, we are setting the third argument this time. It (`O_TRUNC = 0x200`). Secondly, we're setting the third argument this time. It
specifies the permissions our file is created with (`0o755` means user specifies the permissions our file is created with (`0o755` means user
read/write/execute, group/other read/execute). This is not very important to read/write/execute, group/other read/execute). This is not very important to
the actual execution of the program, so don't worry if you don't know the actual execution of the program, so don't worry if you don't know
about UNIX permissions. about UNIX permissions.
Note that the output file's descriptor will be 4.
Now we can start reading from the file. We're going to loop back to this part of Now we can start reading from the file. We're going to loop back to this part of
the code every time we want to read a new hexadecimal number from the input the code every time we want to read a new hexadecimal number from the input
@ -223,13 +225,13 @@ We're telling Linux to output to `0x40026a`, which is just a part of this
segment (see further down). Normally you would read to a different segment of segment (see further down). Normally you would read to a different segment of
the program from where the code is, but we want this to be as simple as the program from where the code is, but we want this to be as simple as
possible. possible.
The number of bytes *actually read*, taking into account that we might have The number of bytes *actually* read, taking into account that we might have
reached the end of the file, is stored in `rax`. reached the end of the file, is stored in `rax`.
- `48 89 c3` `mov rbx, rax` - `48 89 c3` `mov rbx, rax`
- `48 b8 03 00 00 00 00 00 00 00` `mov rax, 3` - `48 b8 03 00 00 00 00 00 00 00` `mov rax, 3`
- `48 39 d8` `cmp rax, rbx` - `48 39 d8` `cmp rax, rbx`
- `0f 8f 50 01 00 00` `jg 0x400250` - `0f 8f 50 01 00 00` `jg +0x150 (0x400250)`
This tells the CPU to jump to a later part of the code (address `0x400250`) if 3 This tells the CPU to jump to a later part of the code (address `0x400250`) if 3
is greater than the number of bytes we got, in other words, if we reached the is greater than the number of bytes we got, in other words, if we reached the
@ -307,7 +309,7 @@ Okay, now `rax` contains the byte specified by the two hex digits we read.
- `48 93` `xchg rax, rbx` - `48 93` `xchg rax, rbx`
- `88 03` `mov byte [rbx], al` - `88 03` `mov byte [rbx], al`
Write the byte to a specific memory location (address `0x40026c`). Put the byte in a specific memory location (address `0x40026c`).
- `48 b8 04 00 00 00 00 00 00 00` `mov rax, 4` - `48 b8 04 00 00 00 00 00 00 00` `mov rax, 4`
- `48 89 c7` `mov rdi, rax` - `48 89 c7` `mov rdi, rax`
@ -356,7 +358,7 @@ This is where we conditionally jumped to way back when we determined if we
reached the end of the file. This calls syscall #60, `exit`, with one argument, reached the end of the file. This calls syscall #60, `exit`, with one argument,
0 (exit code 0, indicating we exited successfully). 0 (exit code 0, indicating we exited successfully).
Normally, you should close files descriptors (with syscall #3), to tell Linux you're Normally, you would close files descriptors (with syscall #3), to tell Linux you're
done with them, but we don't need to. It'll automatically close all our open done with them, but we don't need to. It'll automatically close all our open
file descriptors when our program exits. file descriptors when our program exits.
@ -387,4 +389,4 @@ a while.
But these problems aren't really a big deal. We'll only be running this on But these problems aren't really a big deal. We'll only be running this on
little programs and we'll be sure to check that our input is in the right little programs and we'll be sure to check that our input is in the right
format. And with that, we are ready to move on to the format. And with that, we are ready to move on to the
[next stage...](../01/README.md). [next stage...](../01/README.md)

View file

@ -5,3 +5,5 @@ out00: in00
../00/hexcompile ../00/hexcompile
%.html: %.md ../markdown %.html: %.md ../markdown
../markdown $< ../markdown $<
clean:
rm -f out00 out01 README.html

View file

@ -8,7 +8,7 @@ is the executable for this stage's compiler. Run it (it'll read from the file
`Hello, world!` when run. Let's take a look at the input we're providing to the `Hello, world!` when run. Let's take a look at the input we're providing to the
stage 01 compiler, `in01`: stage 01 compiler, `in01`:
<pre><code> ```
|| ELF Header || ELF Header
;im;01;00;00;00;00;00;00;00 file descriptor for stdout ;im;01;00;00;00;00;00;00;00 file descriptor for stdout
;JA ;JA
@ -24,9 +24,9 @@ stage 01 compiler, `in01`:
;sy ;sy
;'H;'e;'l;'l;'o;',;' ;'w;'o;'r;'l;'d;'!;\n the string we're printing ;'H;'e;'l;'l;'o;',;' ;'w;'o;'r;'l;'d;'!;\n the string we're printing
; ;
</code></pre> ```
Look at that! There are comments! Much nicer than just hexadecimal digit pairs. Look at that! There are even comments! Much nicer than just hexadecimal digit pairs.
## end result ## end result
@ -50,9 +50,9 @@ actually print out an error message and exit, rather than continuing as if
nothing happened! Try adding `xx;` to the end of the file `in01`, and running nothing happened! Try adding `xx;` to the end of the file `in01`, and running
`./out00`. You should get the error message: `./out00`. You should get the error message:
<pre><code> ```
xx not recognized. xx not recognized.
</code></pre> ```
Pretty cool, huh? Pretty cool, huh?
Anyways let's see how this compiler actually works. Anyways let's see how this compiler actually works.
@ -63,7 +63,7 @@ Writing in our stage 00 language is much nicer than editing an
executable, because it's easier to move things around, and also, we can separate executable, because it's easier to move things around, and also, we can separate
our program into lines! Let's take a look at the start: our program into lines! Let's take a look at the start:
<pre><code> ```
7f 45 4c 46 7f 45 4c 46
02 02
01 01
@ -90,7 +90,7 @@ a8 00 40 00 00 00 00 00
00 10 02 00 00 00 00 00 00 10 02 00 00 00 00 00
00 10 02 00 00 00 00 00 00 10 02 00 00 00 00 00
00 10 00 00 00 00 00 00 00 10 00 00 00 00 00 00
</code></pre> ```
This is the ELF header and program header. It's just like our last one, but with This is the ELF header and program header. It's just like our last one, but with
a couple of differences. First, our entry point is at offset 0xa8 instead of 0x78. a couple of differences. First, our entry point is at offset 0xa8 instead of 0x78.
@ -113,7 +113,7 @@ recognized."`
- `00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00` (unused) - `00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00` (unused)
Here's the data for our program. As you can see from my annotations, we have the Here's the data for our program. As you can see from my annotations, we have the
input and output file, as well as the error message. The command part of the input and output file names, as well as the error message. The command part of the
error message is left blank for now (we'll fill it in when the code is actually error message is left blank for now (we'll fill it in when the code is actually
run). run).
@ -182,8 +182,8 @@ program with exit code 0 (successful).
- `48 01 d8` `add rax, rbx` - `48 01 d8` `add rax, rbx`
This here looks at the two bytes we read in (we'll call them `b1` and `b2`) and This here looks at the two bytes we read in (we'll call them `b1` and `b2`) and
computes `b1 * 128 + b2` (more specifically `(b1 << 7) + b2`). This is the index computes `b1 * 128 + b2` (more specifically `(b1 << 7) + b2`). This is the corresponding index
in our command table corresponding to the two characters from the input file. in our command table.
- `48 c1 e0 03` `shl rax, 3` - `48 c1 e0 03` `shl rax, 3`
- `48 89 c3` `mov rbx, rax` - `48 89 c3` `mov rbx, rax`
@ -211,7 +211,7 @@ is `03 48 89 c3`. We set the length to 0 for unused entries.
So this code checks if the entry for this command starts with a zero byte. If it So this code checks if the entry for this command starts with a zero byte. If it
does, that means the two characters we read in don't actually correspond to a does, that means the two characters we read in don't actually correspond to a
real command. If that's the case, this next bit of code is executed (otherwise real command. If that's the case, this next bit of code is executed (otherwise
it's skiped over): it's skipped over):
- `48 b8 02 00 00 00 00 00 00 00` `mov rax, 2 (stderr)` - `48 b8 02 00 00 00 00 00 00 00` `mov rax, 2 (stderr)`
- `48 89 c7` `mov rdi, rax` - `48 89 c7` `mov rdi, rax`
@ -228,7 +228,7 @@ it's skiped over):
- `00 00 00 00 00 00 00 00 00 00 00 00 00 00` (unused) - `00 00 00 00 00 00 00 00 00 00 00 00 00 00` (unused)
This prints our error message, now filled in with the specific unrecognized This prints our error message, now filled in with the specific unrecognized
instruction, to standard error, and exits with code 1, to indicate failure. instruction, to standard error, then exits with code 1, to indicate failure.
- `48 89 eb` `mov rbx, rax` - `48 89 eb` `mov rbx, rax`
- `31 c0` `mov rax, 0` - `31 c0` `mov rax, 0`
@ -273,7 +273,7 @@ all the way back to read the next command. Otherwise, we keep looping. This
skips over any comments/whitespace we might have between a command and the skips over any comments/whitespace we might have between a command and the
following command. following command.
And that's all the *code* for this compiler. Next comes some data. And that's all the *code* for this compiler. Next comes the command table.
First, there's a whole bunch of unused 0s. Then there's the line First, there's a whole bunch of unused 0s. Then there's the line
@ -293,7 +293,7 @@ Which is the encoding of the `syscall` instruction.
You can look through the rest of the table, if you want. But let's look at the You can look through the rest of the table, if you want. But let's look at the
very end: very end:
<code><pre> ```
78 78
7f 45 4c 46 7f 45 4c 46
02 02
@ -321,7 +321,7 @@ very end:
00 00 08 00 00 00 00 00 00 00 08 00 00 00 00 00
00 00 08 00 00 00 00 00 00 00 08 00 00 00 00 00
00 10 00 00 00 00 00 00 00 10 00 00 00 00 00 00
</code></pre> ```
This is at the position for `||`, and it contains an ELF header. One thing you This is at the position for `||`, and it contains an ELF header. One thing you
might notice is that we decided that each entry is 8 bytes long, but this one is might notice is that we decided that each entry is 8 bytes long, but this one is
@ -340,5 +340,5 @@ fixed this, but frankly I've had enough of writing code in hexadecimal. So let's
move on to [stage 02](../02/README.md), move on to [stage 02](../02/README.md),
now that we have a nicer language on our hands. From now now that we have a nicer language on our hands. From now
on, since we have comments, I'm gonna do most of the explaining in the source file on, since we have comments, I'm gonna do most of the explaining in the source file
itself, rather than the README. But there'll still be a bit of stuff there each itself, rather than the README. But there'll still be some stuff there each
time. time.

View file

@ -7,11 +7,12 @@ ff - Byte ff
'a - Character a (byte 0x61) 'a - Character a (byte 0x61)
'! - Character ! (byte 0x21) '! - Character ! (byte 0x21)
etc. etc.
\n - Newline (byte 0x0a)
zA - Zero rax zA - Zero rax
im - Set rax to an immediate value, e.g. im - Set rax to an immediate value, e.g.
im;05;00;00;00;00;00;00;00; im;05;00;00;00;00;00;00;00;
will set rax to 5. will set rax to 5.
ax bx cx dx sp bp si di ax bx cx dx sp bp si di
A B C D S R I J A B C D S R I J

View file

@ -1,7 +1,9 @@
all: out01 out02 README.html all: out01 out02 README.html
out01: in01 out01: in01
../01/out00 ../01/out00
out02: out01 out02: out01 in02
./out01 ./out01
%.html: %.md ../markdown %.html: %.md ../markdown
../markdown $< ../markdown $<
clean:
rm -f out01 out02 README.html

View file

@ -1,13 +1,15 @@
# stage 02 # stage 02
The compiler for this stage is in the file `in01`, an input for our previous compiler. The compiler for this stage is in the file `in01`, an input for our previous compiler.
The specifics of how this compiler works are in the comments in that file, but here I'll So if you run `../01/out00`, you'll get the file `out01`, which is
this stage's compiler.
The specifics of how this compiler works are in the comments in `in01`, but here I'll
give an overview. give an overview.
Let's take a look at `in02`, an example input file for this compiler: Let's take a look at `in02`, an example input file for this compiler:
``` ```
jm jm
:-co jump to code :-co jump to code
::hw ::hw start of hello world
'H 'H
'e 'e
'l 'l
@ -23,11 +25,12 @@ jm
'! '!
\n \n
::he end of hello world ::he end of hello world
::co start of code ::co start of code
// // calculate the length of the hello world string
// now we'll calculate the length of the hello world string
// by subtracting hw from he. // by subtracting hw from he.
//
im im
--he --he
BA BA
@ -36,7 +39,7 @@ im
nA nA
+B +B
DA put length in rdx DA put length in rdx
// okay now we can write it // okay now write it
im im
##1. ##1.
JA set rdi to 1 (stdout) JA set rdi to 1 (stdout)
@ -54,56 +57,123 @@ im
sy sy
``` ```
You can try adding more characters to the hello world message, and it'll just work; We can compile it by running `./out01`. This will produce
the length of the text is computed automatically! the executable `out02`, which you can run. It prints
`Hello, world!`.
This time, commands are separated by newlines instead of semicolons. In this language,
Each line begins with a 2-character command identifier. There are some special identifiers though: commands are separated by newlines instead of semicolons.
Each line begins with a 2-character command.
All of the commands from the previous compiler are here,
plus six new ones:
- `::` marks a *label* - `::` marks a *label*
- `--` outputs a label's (absolute) address - `--` outputs a label's (absolute) address
- `:-` outputs a label's relative address - `:-` outputs a label's relative address
- `##` outputs a number - `##` outputs a number
- `//` is for comments
All other commands work like they did in the previous compiler—if you scroll down in the - `\n\n` does nothing (used for spacing)
`in01` source file, you'll see the full command table.
## labels ## labels
Labels are the most important new feature of this language. Labels are the most important new feature of this language.
A line like
```
::xy
```
associates the name `xy` with the address of the next byte of the program.
In the example program, `hw` is associated with `0x40007d`,
which is the virtual memory address of the `Hello, world!` data.
We can then use
```
--xy
```
to output that address, and
```
:-xy
```
to output it relative to the current address.
So now instead of computing how far to jump, we can just jump to a label, e.g.
```
jm
:-xy (use the relative address, because jumps are relative in x86-64)
```
And instead of figuring out the address of a piece of data, we can just use its label:
```
im
--xy
// rax now points to the data at the label "::xy"
```
This also lets us compute the length of the hello world string automatically!
By taking the address of the end of the string (`he`) and subtracting the
start (`hw`), we get the length in bytes.
So you can try adding more characters to the hello world message, and it'll just work.
All labels must be two ASCII characters. The address of each label is stored
as a 32-bit number in the "label table". This is sort of like the command table—the
index of the label `xy` is `128 * x + y`. Specifically, the entry for `xy` is at
`0x420000 + 4 * (128 * x + y)`, since the label table starts at `0x420000`
and each entry is 4 bytes.
When we encounter `::xy`, we get the current position in the output file
(using `lseek`), add the address of the start of the file (`0x400000`),
and store that in the label table.
When we encounter `:-xy` or `--xy`, we look up `xy` in the label table,
and write the address (subtracting the current address for `:-`) to the output file.
## two passes? ## two passes?
This compiler actually needs to read through the source code,
and output an executable, twice.
This is because a label may be defined *after* it is used, e.g.:
```
jm
:-aa jump forward
...
::aa this is where we're jumping to
...
```
In the first pass, the `:-aa` will
treat `aa` as having an address of 0. Then when
we get to `::aa`, the address in the label table will be corrected.
At the end of the first pass, we seek back to the start
of the input and output files,
and run the exact same code for the second pass.
But this time, the correct address of `aa` is used, namely the
one we calculated in the first pass.
## other features ## other features
Now instead of writing out each of the 8 bytes making up a number, Now instead of writing out each of the 8 bytes making up a number,
we can just write it in hexadecimal (e.g. `##3c.` for `3c 00 00 00 00 00 00 00`), we can just write it in hexadecimal, e.g. `##1c4.` for `c4 01 00 00 00 00 00 00`.
and the compiler will automatically
extend it to 8 bytes.
This is especially nice because we don't need to write numbers backwards This is especially nice because we don't need to write numbers backwards
for little-endianness anymore! for little-endianness anymore!
Numbers cannot appear at the end of a line (this was Numbers cannot appear at the end of a line (this made
to make the compiler simpler to write), so I'm adding a `.` at the end of the compiler simpler to write), so I'm adding a `.` at the end of
each one to avoid making that mistake. each one to avoid making that mistake.
Anything after a command is treated as a comment; Anything after a command is treated as a comment;
additionally `//` can be used for comments on their own lines. additionally `//` can be used for comments on their own lines.
I decided to implement them as simply as possible: I decided to implement this as simply as possible:
I just added the command `//` to the command table, which outputs the byte `0x90`—this I just added the command `//` to the command table, which outputs the byte `0x90`—this
means "do nothing" (`nop`) in x86-64. means ["do nothing"](https://en.wikipedia.org/wiki/No-op)
Note that this means that the following code will not work as expected: in x86-64.
Note that the following code will not work as expected:
``` ```
im im
// load the value 0x333 into rax // load the value 0x333 into rax
##333. ##333.
``` ```
since `0x90` gets inserted between the "load immediate" instruction code, and the immediate. since `0x90` gets inserted between the "load immediate" instruction code and the immediate.
`\n\n` works identically, and lets us space out code a bit. But be careful:
the number of blank lines must be a multiple of 3!
## limitations ## limitations
Many of the limitations of our previous compilers apply to this one. Also, Many of the limitations of our previous compilers apply to this one. Also,
if you use a label without defining it, it uses address 0, rather than outputting if you use a label without defining it, it uses address 0, rather than outputting
an error message. This could be fixed: if the value in the label table is 0, and if we are an error message. This could be fixed: if the value in the label table is 0 and we are
on the second pass, output an error message. This compiler was already tedious enough on the second pass, output an error message. This compiler was already tedious enough
to implement, though! to implement, though!
But thanks to labels, for future compilers at least we won't have to calculate But thanks to labels, for future compilers at least we won't have to calculate

24
02/in01
View file

@ -3,7 +3,7 @@
;'i;'n;'0;'2;00 (0x40007d) input filename ;'i;'n;'0;'2;00 (0x40007d) input filename
;'o;'u;'t;'0;'2;00 (0x400082) output filename ;'o;'u;'t;'0;'2;00 (0x400082) output filename
;00;00;' ;'n;'o;'t;' ;'r;'e;'c;'o;'g;'n;'i;'z;'e;'d;\n;00;00;00;00;00;00 (0x400088) error message/where we read to ;00;00;' ;'n;'o;'t;' ;'r;'e;'c;'o;'g;'n;'i;'z;'e;'d;\n;00;00;00;00;00;00 (0x400088) error message/where we read to
;00 (0x4000a0) stores which pass we're on (1 for second pass) ;00 (0x4000a0) stores which pass we're on (0 for first pass, 1 for second pass)
;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00
;00;00;00;00;00;00;00;00 (0x4000a8) used for output ;00;00;00;00;00;00;00;00 (0x4000a8) used for output
unused padding unused padding
@ -180,11 +180,11 @@ okay it's 0-9
;+B ;+B
;BA ;BA
okay we now have a digit in RBX okay we now have a digit in rbx
;AR ;AR
;<I;04 ;<I;04
;+B ;+B
;RA store away in RBP ;RA store away in rbp
;jm;38;ff;ff;ff continue loop ;jm;38;ff;ff;ff continue loop
unused padding unused padding
@ -195,7 +195,7 @@ unused padding
;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00
;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00;00
okay we have a full number in RBP, time to write it to the file okay we have a full number in rbp, time to write it to the file.
start by putting it at address 0x4000a8 start by putting it at address 0x4000a8
;im;a8;00;40;00;00;00;00;00 ;im;a8;00;40;00;00;00;00;00
;BA ;BA
@ -210,7 +210,7 @@ now write
;IA ;IA
;im;08;00;00;00;00;00;00;00 write 8 bytes ;im;08;00;00;00;00;00;00;00 write 8 bytes
;DA ;DA
;im;01;00;00;00;00;00;00;00 write ;im;01;00;00;00;00;00;00;00 write
;sy ;sy
;jm;c3;03;00;00 skip to newline ;jm;c3;03;00;00 skip to newline
@ -327,11 +327,11 @@ subtract current address
;nA;+B ;nA;+B
;RA store relative address in rbp ;RA store relative address in rbp
now we want to write eax to the output file. now we want to write ebp to the output file.
start by putting it at address 0x4000a8 start by putting it at address 0x4000a8
;im;a8;00;40;00;00;00;00;00 ;im;a8;00;40;00;00;00;00;00
;BA ;BA
;AR put relative address in rax ;AR
;sd ;sd
now write now write
@ -341,7 +341,7 @@ now write
;IA ;IA
;im;04;00;00;00;00;00;00;00 4 bytes ;im;04;00;00;00;00;00;00;00 4 bytes
;DA ;DA
;im;01;00;00;00;00;00;00;00 write ;im;01;00;00;00;00;00;00;00 write
;sy ;sy
;jm;66;01;00;00 skip to newline ;jm;66;01;00;00 skip to newline
@ -368,7 +368,7 @@ it's not a label or a number. let's look it up in the instruction table.
;BA ;BA
;RA store away address of command text in rbp ;RA store away address of command text in rbp
;zA;lb ;zA;lb
;DA number of bytes to write (used for syscall if no error) ;DA number of bytes to write (used for syscall if command exists)
;BA ;BA
;zA ;zA
;cm;jn;54;00;00;00 check if # of bytes is 0, if not, skip outputting error ;cm;jn;54;00;00;00 check if # of bytes is 0, if not, skip outputting error
@ -392,7 +392,7 @@ this is a real command
;im;01;00;00;00;00;00;00;00 add 1 because we don't want to write the length ;im;01;00;00;00;00;00;00;00 add 1 because we don't want to write the length
;+B ;+B
;IA address of data to write ;IA address of data to write
;im;04;00;00;00;00;00;00;00 out file descriptor ;im;04;00;00;00;00;00;00;00 out file descriptor
;JA ;JA
;im;01;00;00;00;00;00;00;00 write ;im;01;00;00;00;00;00;00;00 write
;sy ;sy
@ -1777,7 +1777,7 @@ the formatting changed appropriately.
;00;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00;00
;00;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00;00
;00;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00;00
;00;00;00;00;00;00;00;00 ;01;90;00;00;00;00;00;00 \n\n
;00;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00;00
;00;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00;00
;00;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00;00
@ -6550,7 +6550,7 @@ the formatting changed appropriately.
;00;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00;00
;00;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00;00
;00;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00;00
;01;90;00;00;00;00;00;00 ;01;90;00;00;00;00;00;00 // comments
;00;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00;00
;00;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00;00
;00;00;00;00;00;00;00;00 ;00;00;00;00;00;00;00;00

11
02/in02
View file

@ -1,6 +1,6 @@
jm jm
:-co jump to code :-co jump to code
::hw ::hw start of hello world
'H 'H
'e 'e
'l 'l
@ -16,11 +16,12 @@ jm
'! '!
\n \n
::he end of hello world ::he end of hello world
::co start of code ::co start of code
// // calculate the length of the hello world string
// now we'll calculate the length of the hello world string
// by subtracting hw from he. // by subtracting hw from he.
//
im im
--he --he
BA BA
@ -29,7 +30,7 @@ im
nA nA
+B +B
DA put length in rdx DA put length in rdx
// okay now we can write it // okay now write it
im im
##1. ##1.
JA set rdi to 1 (stdout) JA set rdi to 1 (stdout)

View file

@ -2,6 +2,12 @@ all: markdown README.html
$(MAKE) -C 00 $(MAKE) -C 00
$(MAKE) -C 01 $(MAKE) -C 01
$(MAKE) -C 02 $(MAKE) -C 02
clean:
$(MAKE) -C 00 clean
$(MAKE) -C 01 clean
$(MAKE) -C 02 clean
rm -f markdown
rm -f README.html
markdown: markdown.c markdown: markdown.c
$(CC) -O2 -o markdown -Wall -Wconversion -Wshadow -std=c89 markdown.c $(CC) -O2 -o markdown -Wall -Wconversion -Wshadow -std=c89 markdown.c
README.html: markdown README.md README.html: markdown README.md

View file

@ -17,7 +17,14 @@ Note that the executables produced in this series will only run on
64-bit Linux, because each OS/architecture combination would need its own separate 64-bit Linux, because each OS/architecture combination would need its own separate
executable. executable.
The README for the first stage is [here](00/README.md). ## table of contents
- [stage 00](00/README.md) - a program converting a text file with
hexadecimal digit pairs to a binary file.
- [stage 01](01/README.md) - a language with comments, and 2-character
command codes.
- [stage 02](02/README.md) - a language with labels
- more coming soon (hopefully)
## prerequisite knowledge ## prerequisite knowledge
@ -44,8 +51,7 @@ decimal.
- ASCII, null-terminated strings - ASCII, null-terminated strings
- how pointers work - how pointers work
- how floating-point numbers work - how floating-point numbers work
- maybe some basic Intel-style x86-64 assembly (you can probably pick it up on - some basic Intel-style x86-64 assembly
the way though)
It will help you a lot to know how to program (with any programming language), It will help you a lot to know how to program (with any programming language),
but it's not strictly necessary. but it's not strictly necessary.
@ -53,12 +59,11 @@ but it's not strictly necessary.
## instruction set ## instruction set
x86-64 has a *gigantic* instruction set. The manual for it is over 2,000 pages x86-64 has a *gigantic* instruction set. The manual for it is over 2,000 pages
long! So, it makes sense to select only a small subset of it to use for all the long! So it makes sense to select only a small subset of it to use.
stages of our compiler. The set I've chosen can be found in `instructions.txt`. The set I've chosen can be found in `instructions.txt`.
I think it achieves a pretty good balance between having few enough I think it achieves a pretty good balance between having few enough
instructions to be manageable and having enough instructions to be useable. instructions to be manageable and having enough instructions to be useable.
To be clear, you don't need to read that file to understand the series, at least To be clear, you don't need to read that file to understand the series.
not right away.
## principles ## principles
@ -91,15 +96,15 @@ project can't necessarily even do that though, because the Linux kernel, which
we depend on, is compiled from C, so we can't fully trust *it*. To *truly* we depend on, is compiled from C, so we can't fully trust *it*. To *truly*
create a fully trustable compiler, you'd need to manually write to a USB with a create a fully trustable compiler, you'd need to manually write to a USB with a
circuit, create an operating system from nothing (without even a text editor), circuit, create an operating system from nothing (without even a text editor),
and then follow this series, or maybe you don't even trust your CPU vendor... and then follow this series, or maybe you don't even trust your CPU...
I'll leave that to someone else I'll leave that to someone else.
## license ## license
``` ```
This project is in the public domain. Any copyright protections from any law This project is in the public domain. Any copyright protections from any law
for this project are forfeited by the author(s). No warranty is provided for are forfeited by the author(s). No warranty is provided, and the author(s)
this project, and the author(s) shall not be held liable in connection with it. shall not be held liable in connection with it.
``` ```
## contributing ## contributing

View file

@ -101,3 +101,4 @@ syscall
>0f 05 >0f 05
nop nop
>90 >90
(more will be added as needed)

View file

@ -58,7 +58,8 @@ static void output_md_text(FILE *out, int *flags, int line_number, const char *t
case '[': { case '[': {
/* link */ /* link */
char url2[256] = {0}; char url2[256] = {0};
const char *label, *url, *label_end, *url_end, *dot; const char *label, *url, *label_end, *url_end;
char *dot;
int n_label, n_url; int n_label, n_url;
label = p+1; label = p+1;
@ -88,7 +89,7 @@ static void output_md_text(FILE *out, int *flags, int line_number, const char *t
/* replace links to md files with links to html files */ /* replace links to md files with links to html files */
strcpy(dot, ".html"); strcpy(dot, ".html");
} }
fprintf(out, "<a href=\"%s\" target=\"_blank\">%.*s</a>", fprintf(out, "<a href=\"%s\">%.*s</a>",
url2, n_label, label); url2, n_label, label);
p = url_end; p = url_end;
} break; } break;