start 00 readme

This commit is contained in:
pommicket 2021-08-30 19:09:20 -07:00
parent 78d2d93c7c
commit 9bcbd94e46
5 changed files with 173 additions and 1 deletions

1
00/.gitignore vendored
View file

@ -1,2 +1 @@
A
B

1
00/A Normal file
View file

@ -0,0 +1 @@
48 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 0a

147
00/README.txt Normal file
View file

@ -0,0 +1,147 @@
--- stage 00 ---
This directory contains the file 'hexcompile', a handwritten executable.
It takes an input file A containing space/newline/[any character]-separated
hexadecimal numbers and outputs them as bytes to the file B. On 64-bit Linux,
try running ./hexcompile from this directory (I've already provided an A file),
and you will get a file named B containing the text "Hello, world!".
I made this program so that you can use your favorite text editor to write
executables by hand (which have bytes outside of ASCII/UTF-8).
I wrote it with a program called hexedit, which can be found on most Linux
distributions. Only 64-bit Linux is supported, because each OS/architecture
combination would need its own separate executable. The executable is 632 bytes
long, and you could definitely make it smaller if you wanted to. Let's take a
look at what's inside (see hexdump -C hexcompile):
7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
02 00 3e 00 01 00 00 00 78 00 40 00 00 00 00 00
40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 40 00 38 00 01 00 00 00 00 00 00 00
01 00 00 00 07 00 00 00 78 00 00 00 00 00 00 00
78 00 40 00 00 00 00 00 00 00 00 00 00 00 00 00
00 02 00 00 00 00 00 00 00 02 00 00 00 00 00 00
00 10 00 00 00 00 00 00 48 b8 74 02 40 00 00 00
00 00 48 89 c7 48 b8 00 00 00 00 00 00 00 00 48
89 c6 48 89 c2 48 b8 02 00 00 00 00 00 00 00 0f
05 48 89 c5 48 b8 76 02 40 00 00 00 00 00 48 89
c7 48 b8 41 00 00 00 00 00 00 00 48 89 c6 48 b8
a4 01 00 00 00 00 00 00 48 89 c2 48 b8 02 00 00
00 00 00 00 00 0f 05 48 89 c1 48 89 ef 48 b8 68
02 40 00 00 00 00 00 48 89 c6 48 b8 03 00 00 00
00 00 00 00 48 89 c2 48 b8 00 00 00 00 00 00 00
00 0f 05 48 89 c3 48 b8 03 00 00 00 00 00 00 00
48 39 d8 0f 8f 37 01 00 00 48 b8 68 02 40 00 00
00 00 00 48 89 c3 48 8b 03 48 89 c3 48 89 c7 48
b8 ff 00 00 00 00 00 00 00 48 21 d8 48 89 c6 48
b8 39 00 00 00 00 00 00 00 48 89 c3 48 89 f0 48
39 d8 0f 8f 1e 00 00 00 48 b8 30 00 00 00 00 00
00 00 48 f7 d8 48 89 f3 48 01 d8 e9 26 00 00 00
00 00 00 00 00 00 48 b8 a9 ff ff ff ff ff ff ff
48 89 f3 48 01 d8 e9 0b 00 00 00 00 00 00 00 00
00 00 00 00 00 00 48 89 c2 48 b8 ff 00 00 00 00
00 00 00 48 89 c3 48 89 f8 48 c1 e8 08 48 21 d8
48 93 48 b8 39 00 00 00 00 00 00 00 48 93 48 39
d8 0f 8f 1f 00 00 00 48 89 c3 48 b8 d0 ff ff ff
ff ff ff ff 48 01 d8 e9 2a 00 00 00 00 00 00 00
00 00 00 00 00 00 48 89 c3 48 b8 a9 ff ff ff ff
ff ff 48 01 d8 e9 0c 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 48 89 c7 48 89 d0 48 c1 e0 04
48 89 fb 48 09 d8 48 93 48 b8 68 02 40 00 00 00
00 00 48 93 48 89 03 48 89 de 48 b8 04 00 00 00
00 00 00 00 48 89 c7 48 b8 01 00 00 00 00 00 00
00 48 89 c2 0f 05 e9 8f fe ff ff 00 00 00 00 00
48 b8 3c 00 00 00 00 00 00 00 0f 05 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 41 00 42 00
Okay, that doesn't tell us much, I'll annotate it below. You might notice that
all the numbers are backwards, e.g. 3e 00 for the number 0x003e (62 decimal).
This is because almost all modern architectures (including x86-64) are
little-endian, meaning that the *least significant byte* goes first, and the
most significant byte goes last. There are various reasons why this is easier to
deal with, which I won't explain here.
-- ELF header --
This header has a bunch of metadata about the executable.
7f 45 4c 46 - Special identifier saying that this is an ELF file (ELF is the
format of almost all Linux executables)
02 - 64-bit
01 - Little-endian
01 - ELF version 1 (there is no version 2 yet)
00 00 00 00 00 00 00 00 00 - Reserved (not important yet, but may be in a later
version of ELF)
02 00 - This is an executable file (not a dynamic library/etc)
3e 00 - Architecture x86-64
01 00 00 00 - Version 1 of ELF (minor version or something)
78 00 40 00 00 00 00 00 - **Entry point of the executable** = 0x400078 (explained later)
40 00 00 00 00 00 00 00 - Program header table offset in bytes from start of file (see below)
00 00 00 00 00 00 00 00 - Section header table offset (we're not using sections)
00 00 00 00 - Flags (not important)
40 00 - The size of this header, in bytes = 64
38 00 - Size of the program header (see below) = 56
01 00 - Number of program headers = 1
00 00 - Size of each section header (unused)
00 00 - Number of section headers (unused)
00 00 - Index of special .shstrtab section (unused)
-- Program header --
The program header describes a segment of data that is loaded into memory when
the program starts. Normally, you would have more than one of these, one for
code, one for read-only data, and one for read-write data, perhaps, but to
simplify things we've only got one, which we'll use for any code and any data
we need. This means it'll have to be read-enabled, write-enabled, *and*
execute-enabled. Normally people don't do this, for security, but we won't worry
about that (don't compile any untrusted code with any compiler from this series!)
Without further ado, here's the contents of the program header:
01 00 00 00 - Segment type 1 (this should be loaded into memory)
07 00 00 00 - Flags = RWE (readable, writeable, and executable)
78 00 00 00 00 00 00 00 - Offset in file = 120
78 00 40 00 00 00 00 00 - Virtual address = 0x400078
- Wait a minute, what's that? -
We just specified the *virtual address* of this segment. This is the virtual
memory address that the segment will be loaded to. Virtual memory means that
memory addresses in our program do not actually correspond to where the memory
is physically stored in RAM. There are many reasons for it, including allowing
different processes to have overlapping memory addresses, making sure that some
memory can't be read/written/executed, etc. You can read more about it
elsewhere.
00 00 00 00 00 00 00 00 - Physical address (not applicable)
00 02 00 00 00 00 00 00 - Size of this segment in the executable file = 512
bytes
00 02 00 00 00 00 00 00 - Size of this segment when loaded into memory = also
512 bytes
00 10 00 00 00 00 00 00 - Segment alignment = 4096 bytes
48 b8 74 02 40 00 00 00
00 00 48 89 c7 48 b8 00 00 00 00 00 00 00 00 48
89 c6 48 89 c2 48 b8 02 00 00 00 00 00 00 00 0f
05 48 89 c5 48 b8 76 02 40 00 00 00 00 00 48 89
c7 48 b8 41 00 00 00 00 00 00 00 48 89 c6 48 b8
a4 01 00 00 00 00 00 00 48 89 c2 48 b8 02 00 00
00 00 00 00 00 0f 05 48 89 c1 48 89 ef 48 b8 68
02 40 00 00 00 00 00 48 89 c6 48 b8 03 00 00 00
00 00 00 00 48 89 c2 48 b8 00 00 00 00 00 00 00
00 0f 05 48 89 c3 48 b8 03 00 00 00 00 00 00 00
48 39 d8 0f 8f 37 01 00 00 48 b8 68 02 40 00 00
00 00 00 48 89 c3 48 8b 03 48 89 c3 48 89 c7 48
b8 ff 00 00 00 00 00 00 00 48 21 d8 48 89 c6 48
b8 39 00 00 00 00 00 00 00 48 89 c3 48 89 f0 48
39 d8 0f 8f 1e 00 00 00 48 b8 30 00 00 00 00 00
00 00 48 f7 d8 48 89 f3 48 01 d8 e9 26 00 00 00
00 00 00 00 00 00 48 b8 a9 ff ff ff ff ff ff ff
48 89 f3 48 01 d8 e9 0b 00 00 00 00 00 00 00 00
00 00 00 00 00 00 48 89 c2 48 b8 ff 00 00 00 00
00 00 00 48 89 c3 48 89 f8 48 c1 e8 08 48 21 d8
48 93 48 b8 39 00 00 00 00 00 00 00 48 93 48 39
d8 0f 8f 1f 00 00 00 48 89 c3 48 b8 d0 ff ff ff
ff ff ff ff 48 01 d8 e9 2a 00 00 00 00 00 00 00
00 00 00 00 00 00 48 89 c3 48 b8 a9 ff ff ff ff
ff ff 48 01 d8 e9 0c 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 48 89 c7 48 89 d0 48 c1 e0 04
48 89 fb 48 09 d8 48 93 48 b8 68 02 40 00 00 00
00 00 48 93 48 89 03 48 89 de 48 b8 04 00 00 00
00 00 00 00 48 89 c7 48 b8 01 00 00 00 00 00 00
00 48 89 c2 0f 05 e9 8f fe ff ff 00 00 00 00 00
48 b8 3c 00 00 00 00 00 00 00 0f 05 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 41 00 42 00

Binary file not shown.

25
README.txt Normal file
View file

@ -0,0 +1,25 @@
--- boostrapping a (Linux x86-64) C compiler ---
Compilers nowadays are written in languages like C, which themselves need to be
compiled. But then, you need a C compiler to compile your C compiler! Of course,
the very first C compiler was not written in C (because how would it be
compiled?). Instead, it was slowly built up, starting from a very basic
assembler, eventually reacing a full-scale compiler. This process is known as
bootstrapping. In this repository, we'll explore how that's done. Each directory
represents a new "stage" in the process. The first one, "00", is a hand-written
executable, and the last one will be a C compiler. Each directory has its own
README.txt explaining in full what's going on.
-- instruction set --
x86-64 has a *gigantic* instruction set. The manual for it is over 2,000 pages
long! So, it makes sense to select only a small subset of it to use for all the
stages of our compiler. The set I've chosen can be found in instructions.txt (a
work in progress). I think it achieves a pretty good balance between
having few enough instructions to be manageable and having enough
instructions to be useable.
-- license --
This software is in the public domain. Any copyright protections from any law
for this software are forfeited by the author(s). No warranty is provided for
this software, and the author(s) shall not be held liable in connection with it.