readme edits

2021-11-10 00:52:34 -05:00 · 2021-11-10 00:52:34 -05:00 · 3255cd32d7
commit 3255cd32d7
parent befd4a6435
7 changed files with 196 additions and 63 deletions
--- a/00/README.md
+++ b/00/README.md
@ -5,13 +5,14 @@ takes input file `in00` containing space/newline/(any character)-separated
 hexadecimal digit pairs (e.g. `3f`) and outputs them as bytes to the file
 `out00`. On 64-bit Linux, try running `./hexcompile` from this directory (I've
 already provided an `in00` file, which you can take a look at), and you will get
-a file named `out00` containing the text `Hello, world!`.  This stage is needed
-so that you can use your favorite text editor to write executables by hand
-(which have bytes outside of ASCII/UTF-8). I wrote it with a program called
-hexedit, which can be found on most Linux distributions.  Only 64-bit Linux is
-supported, because each OS/architecture combination would need its own separate
-executable. The executable is just 632 bytes long, and you could definitely make
-it even smaller if you wanted to. Let's take a look at what's inside (`od -t x1
+a file named `out00` containing the text `Hello, world!`.  This stage
+lets you use your favorite text editor to write executables
+(which have bytes outside of ASCII/UTF-8).
+I made `hexcompile` with a program called
+[hexedit](https://github.com/pixel/hexedit),
+which can be found in most Linux package managers.
+The executable is just 632 bytes long.
+Let's take a look at what's inside (`od -t x1
 -An -v hexcompile`):

 ```
@ -74,12 +75,12 @@ version of ELF)
 - `02 00` Object type = executable file (not a dynamic library/etc.)
 - `3e 00` Architecture x86-64
 - `01 00 00 00` Version 1 of ELF, again 
- `78 00 40 00 00 00 00 00` **Entry point of the executable** = 0x400078 (explained later)
- `40 00 00 00 00 00 00 00` Program header table offset in bytes from start of file (see below)
+- `78 00 40 00 00 00 00 00` **Entry point of the executable** = 0x400078
+- `40 00 00 00 00 00 00 00` Program header table offset in bytes from start of file
 - `00 00 00 00 00 00 00 00` Section header table offset (we're not using sections)
- `00 00 00 00` Flags (not important)
+- `00 00 00 00` Flags (not important to us)
 - `40 00` The size of this header, in bytes = 64
- `38 00` Size of the program header (see below) = 56
+- `38 00` Size of the program header = 56
 - `01 00` Number of program headers = 1
 - `00 00` Size of each section header (unused)
 - `00 00` Number of section headers (unused)
@ -88,8 +89,8 @@ version of ELF)
 You might notice that all the numbers are backwards, e.g. `38 00` for the number
 0x0038 (56 decimal). This is because almost all modern architectures (including
 x86-64) are little-endian, meaning that the *least significant byte* goes first,
-and the most significant byte goes last. There are various reasons why this is
-easier to deal with, but I won't explain that here.
+and the most significant byte goes last.
+There are reasons for this ([see here](https://en.wikipedia.org/wiki/Endianness#Optimization), for example, if you're interested).

 ## program header
 The program header describes a segment of data that is loaded into memory when
@ -108,11 +109,12 @@ Without further ado, here's the contents of the program header:

 **wait a minute, what's that?**

-We just specified the *virtual address* of this segment. This is the virtual
-memory address that the segment will be loaded to. Virtual memory means that
-addresses in our program do not actually correspond to where the memory is
-physically stored in RAM, with the CPU translating between virtual and physical
-memory addresses. There are many reasons for this: making sure each process has
+This is the virtual
+memory address that the segment will be loaded to.
+Nowadays, computers use virtual memory, meaning that
+addresses in our program don't actually correspond to where the memory is
+physically stored in RAM (the CPU translates between virtual and physical
+memory addresses). There are many reasons for this: making sure each process has
 its own memory space, memory protection, etc. You can read more about it
 elsewhere.

@ -129,11 +131,10 @@ that is a multiple of 4096. Our program needs to be loaded into a memory page,
 so its *virtual address* needs to be a multiple of 4096. We're using `0x400000`.
 But wait! Didn't we use `0x400078` for the virtual address? Well, yes but that's
 because the *data in the file* is loaded to address `0x400078`. The actual page
-of memory that the OS will allocate for our code will start at `0x400000`. The
-reason we need to start `0x78` bytes in is that Linux expects the data *in the
-file* to be at the same position in the page as when it will be loaded, and it
-appears at offset `0x78` in our file. But don't worry if you don't understand
-that.
+of memory that the OS will allocate for our segment will start at `0x400000`. The
+reason we need to start `0x78` bytes in is that Linux expects the data in the
+file to be at the same position in the page as when it will be loaded, and it
+appears at offset `0x78` in our file.

 ## the code

@ -163,16 +164,17 @@ about it with `man 2 open`.
 The first argument, `0x40026d`, is a pointer to some data at the very end of
 this segment (see further down). Specifically, it holds the bytes
 `69 6e 30 30 00`, the null-terminated ASCII string `"in00"`.
-This indicates the name of the file. The second argument (`O_RDONLY`, or 0)
-specifies that we will be reading from this file. There is a third argument to
+This indicates the name of the file. The second argument, `0`,
+specifies that we will (only) be reading from this file. There is a third argument to
 this syscall (we'll get to it later), but it's not applicable here so we don't
 set it.

-This call gives us back a *file descriptor*, which can be used to read from the
+This call gives us back a *file descriptor*, a number which we can use to read from the
 file, in register `rax`. But we don't actually need to look at what file
 descriptor Linux gave us. This is because Linux assigns file descriptor numbers
-sequentially, starting from `0` for standard input, `1` for standard output, `2`
-for standard error, and then `3, 4, 5, ...` for any files our program opens. So
+sequentially, starting from
+[0 for stdin, 1 for stdout, 2 for stderr](https://en.wikipedia.org/wiki/Standard_streams),
+and then 3, 4, 5, ... for any files our program opens. So
 this file, the first one our program opens, will have descriptor `3`.

 Now we open our output file:
@ -194,8 +196,8 @@ file (`O_WRONLY = 0x01`), that we want to create it if it doesn't exist
 (`O_TRUNC = 0x200`). Secondly, we are setting the third argument this time.  It
 specifies the permissions our file is created with (`0o755` means user
 read/write/execute, group/other read/execute). This is not very important to
-the actual execution of the program, so don't worry if you don't know what it
-means.
+the actual execution of the program, so don't worry if you don't know 
+about UNIX permissions.

 Now we can start reading from the file. We're going to loop back to this part of
 the code every time we want to read a new hexadecimal number from the input
@ -210,7 +212,7 @@ file.
 - `0f 05` `syscall`

 In C, this is `read(3, 0x40026a, 3)`. Here we call syscall #0, `read`, with
-arguments:
+three arguments:

 - `fd = 3` This is the descriptor number of our input file.
 - `buf = 0x40026a` This is the memory address we want Linux to output the data
@ -221,7 +223,6 @@ We're telling Linux to output to `0x40026a`, which is just a part of this
 segment (see further down). Normally you would read to a different segment of
 the program from where the code is, but we want this to be as simple as
 possible.
-
 The number of bytes *actually read*, taking into account that we might have
 reached the end of the file, is stored in `rax`.

@ -231,8 +232,8 @@ reached the end of the file, is stored in `rax`.
 - `0f 8f 50 01 00 00` `jg 0x400250`

 This tells the CPU to jump to a later part of the code (address `0x400250`) if 3
-is greater than the number of bytes read in (in other words, if we reached the
-end of the file). Note that we don't specifiy the *address* to jump to, but
+is greater than the number of bytes we got, in other words, if we reached the
+end of the file. Note that we don't specifiy the *address* to jump to, but
 instead the *relative address*, relative to the first byte after the jump
 instruction (so here we're saying to jump `0x150` bytes forward). There are
 reasons for this which I won't get into here.
@ -299,7 +300,7 @@ the one above:
 - `48 89 fb` `mov rbx, rdi`
 - `48 09 d8` `or rax, rbx`

-Okay, now we have the byte specified by the two hex digits we read in `rax`.
+Okay, now `rax` contains the byte specified by the two hex digits we read.

 - `48 89 c3` `mov rbx, rax`
 - `48 b8 6c 02 40 00 00 00 00 00` `mov rax, 0x40026c`
@ -343,8 +344,8 @@ These bytes aren't actually used by our program, and could be set to anything.
 These are here because I wasn't sure how long the program would be when I
 started, so I just set the segment size to 512 bytes, which turned out to be
 more than enough. I could have cut these out and edited all the addresses to get
-a smaller, cleaner executable, but I'm leaving them in because that's what you
-probably would do if you were doing this for non-instructional purposes.
+a smaller executable, but really there's no point—modern
+computers can definitely handle 600-byte files.

 - `31 c0` `mov rax, 0`
 - `48 89 c7` `mov rdi, rax`
@ -355,7 +356,7 @@ This is where we conditionally jumped to way back when we determined if we
 reached the end of the file. This calls syscall #60, `exit`, with one argument,
 0 (exit code 0, indicating we exited successfully).

-You'd normally close the files first (with syscall #3), to tell Linux you're
+Normally, you should close files descriptors (with syscall #3), to tell Linux you're
 done with them, but we don't need to. It'll automatically close all our open
 file descriptors when our program exits.

@ -373,8 +374,8 @@ editor and get them translated into a binary file.

 There are many ways in which this is a bad program. It will *only* properly
 handle lowercase hexadecimal digit pairs, separated by exactly one character,
-with a terminating character. What's worse, a bad input file (maybe you
-accidentally write `3F` instead of `3f`) won't print out a nice error message,
+with a terminating character. What's worse, a bad input file (maybe someone
+accidentally writes `3F` instead of `3f`) won't print out a nice error message,
 but instead continue processing as usual, without any indication that anything's
 gone wrong, giving you an unexpected result.
 Also, we only read in data *three bytes at a time*, and output one byte at a
@ -385,4 +386,5 @@ a while.

 But these problems aren't really a big deal. We'll only be running this on
 little programs and we'll be sure to check that our input is in the right
-format. And with that, we are ready to move on to the next stage...
+format. And with that, we are ready to move on to the
+[next stage...](../01/README.md).