more 05 readme
This commit is contained in:
parent
06def8fb86
commit
59b7931165
1 changed files with 98 additions and 2 deletions
100
05/README.md
100
05/README.md
|
@ -16,11 +16,107 @@ it, you should get the output
|
||||||
Hello, world!
|
Hello, world!
|
||||||
```
|
```
|
||||||
|
|
||||||
So now we just compile TCC with itself, and we're done, right?
|
## the C compiler
|
||||||
Well, not quite...
|
|
||||||
|
The C compiler for this stage is written in the [04 language](../04/README.md), using the [04a preprocessor](../04a/README.md)
|
||||||
|
and is spread out across multiple files:
|
||||||
|
|
||||||
|
```
|
||||||
|
util.b - various utilities (syscall, puts, memset, etc.)
|
||||||
|
constants.b - numerical and string constants used by the rest of the program
|
||||||
|
idents.b - functions for creating mappings from identifiers to arbitrary 64-bit values
|
||||||
|
preprocess.b - preprocesses C files
|
||||||
|
tokenize.b - turns preprocessing tokens into tokens (see explanation below)
|
||||||
|
parse.b - turns tokens into a nice representation of the program
|
||||||
|
codegen.b - turns parse.b's representation into actual code
|
||||||
|
main.b - puts everything together
|
||||||
|
```
|
||||||
|
|
||||||
|
The whole thing is ~12,000 lines of code, which is ~280KB when compiled.
|
||||||
|
|
||||||
|
### the C standard
|
||||||
|
|
||||||
|
In 1989, the C programming language was standardized by the [ANSI](https://en.wikipedia.org/wiki/American_National_Standards_Institute).
|
||||||
|
|
||||||
|
The C89 standard (in theory) defines which C programs are legal, and exactly what any particular legal C program does.
|
||||||
|
A draft of it, which is about as good as the real thing, is [available here](http://port70.net/~nsz/c/c89/c89-draft.html).
|
||||||
|
|
||||||
|
Since 1989, more features have been added to C, and so more C standards have been published.
|
||||||
|
To keep things simple, our compiler only supports the features from C89 (with a few exceptions).
|
||||||
|
|
||||||
|
|
||||||
|
### compiling a C program
|
||||||
|
|
||||||
|
Compiling a C program involves several "translation phases" (C89 standard § 2.1.1.2).
|
||||||
|
Here, I'll only be outlining the process our C compiler uses. The technical details
|
||||||
|
of the standard are slightly different.
|
||||||
|
|
||||||
|
First, each time a backslash is immediately followed by a newline, both are deleted, e.g.
|
||||||
|
```
|
||||||
|
Hel\
|
||||||
|
lo,
|
||||||
|
wo\
|
||||||
|
rld!
|
||||||
|
```
|
||||||
|
becomes
|
||||||
|
```
|
||||||
|
Hello,
|
||||||
|
world!
|
||||||
|
```
|
||||||
|
Well, we actually turn this into
|
||||||
|
```
|
||||||
|
Hello,
|
||||||
|
|
||||||
|
world!
|
||||||
|
|
||||||
|
```
|
||||||
|
so that line numbers are preserved for errors (this doesn't change the meaning of any program).
|
||||||
|
This feature exists so that you can spread one line of code across multiple lines, which is useful sometimes.
|
||||||
|
|
||||||
|
Then, comments are deleted (technically, replaced with spaces), and the file is split up into
|
||||||
|
*preprocesing tokens*. A preprocessing token is one of:
|
||||||
|
|
||||||
|
- A number (e.g. `5`, `10.2`, `3.6.6`)
|
||||||
|
- A string literal (e.g. `"Hello"`)
|
||||||
|
- A symbol (e.g. `<`, `{`, `.`)
|
||||||
|
- An identifier (e.g. `int`, `x`, `main`)
|
||||||
|
- A character constant (e.g. `'a'`, `'\n'`)
|
||||||
|
- A space character
|
||||||
|
- A newline character
|
||||||
|
|
||||||
|
Note that preprocessing tokens are just strings of characters, and aren't assigned any meaning yet; `3.6.6e-.3` is a valid
|
||||||
|
"preprocessing number" even though it's gibberish.
|
||||||
|
|
||||||
|
Next, preprocessor directives are executed. These include things like
|
||||||
|
```
|
||||||
|
#define A_NUMBER 4
|
||||||
|
```
|
||||||
|
which will replace every preprocessing token consisting of the identifier `A_NUMBER` in the rest of the program with `4`. Also in this phase,
|
||||||
|
```
|
||||||
|
#include "X"
|
||||||
|
```
|
||||||
|
is replaced with the (preprocessing tokens in the) file named `X`.
|
||||||
|
|
||||||
|
Then preprocessing tokens are turned into *tokens*.
|
||||||
|
Tokens are one of:
|
||||||
|
|
||||||
|
- A keyword (e.g. `int`, `while`)
|
||||||
|
- A symbol (e.g. `<`, `-`, `{`)
|
||||||
|
- An identifier (e.g. `main`, `f`, `x_3`)
|
||||||
|
- An integer literal (e.g. `77`, `0x123`)
|
||||||
|
- A character literal (e.g. `'a'`, `'\n'`)
|
||||||
|
- A floating-point literal (e.g. `3.6`, `5e10`)
|
||||||
|
|
||||||
|
## limitations
|
||||||
|
|
||||||
|
## modifications of tcc's source code
|
||||||
|
|
||||||
|
|
||||||
## the nightmare begins
|
## the nightmare begins
|
||||||
|
|
||||||
|
So now we just compile TCC with itself, and we're done, right?
|
||||||
|
Well, not quite...
|
||||||
|
|
||||||
The issue here is that to compile TCC/GCC with TCC, we need libc, the C standard library functions.
|
The issue here is that to compile TCC/GCC with TCC, we need libc, the C standard library functions.
|
||||||
Our C compiler just includes these functions in the standard header files, but normally
|
Our C compiler just includes these functions in the standard header files, but normally
|
||||||
the code for them is located in a separate library file (called something like
|
the code for them is located in a separate library file (called something like
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue