Add stage 06: Lua bootstrap

The goal of stage 06 is to try parse zig synax in lua. I pulled in
lpeglable 1.2.0 and parser-gen off github to get started. All of this
needs to be cleaned up rather soon.

Lua boostraps using tcc and musl from the previous stage. Since musl
0.6.0 doesn't support dynamic linking this build of lua doesn't support
shared libraries. I couldn't easily patch musl with dlopen and friends
so instead I link statically and call deps with c api.
This commit is contained in:
Dawid Sobczak 2023-07-06 11:48:59 +01:00
parent 2ae045cf8a
commit e6b88d5a0f
170 changed files with 72518 additions and 2 deletions

397
06/parser-gen/README.md Normal file
View file

@ -0,0 +1,397 @@
# parser-gen
A Lua parser generator that makes it possible to describe grammars in a [PEG](https://en.wikipedia.org/wiki/Parsing_expression_grammar) syntax. The tool will parse a given input using a provided grammar and if the matching is successful produce an AST as an output with the captured values using [Lpeg](http://www.inf.puc-rio.br/~roberto/lpeg/). If the matching fails, labelled errors can be used in the grammar to indicate failure position, and recovery grammars are generated to continue parsing the input using [LpegLabel](https://github.com/sqmedeiros/lpeglabel). The tool can also automatically generate error labels and recovery grammars for LL(1) grammars.
parser-gen is a [GSoC 2017](https://developers.google.com/open-source/gsoc/) project, and was completed with the help of my mentor [@sqmedeiros](https://github.com/sqmedeiros) from [LabLua](http://www.lua.inf.puc-rio.br/). A blog documenting the progress of the project can be found [here](https://parsergen.blogspot.com/2017/08/parser-generator-based-on-lpeglabel.html).
---
# Table of contents
* [Requirements](#requirements)
* [Syntax](#syntax)
* [Grammar Syntax](#grammar-syntax)
* [Example: Tiny Parser](#example-tiny-parser)
# Requirements
```
lua >= 5.1
lpeglabel >= 1.2.0
```
# Syntax
### compile
This function generates a PEG parser from the grammar description.
```lua
local pg = require "parser-gen"
grammar = pg.compile(input,definitions [, errorgen, noast])
```
*Arguments*:
`input` - A string containing a PEG grammar description. For complete PEG syntax see the grammar section of this document.
`definitions` - table of custom functions and definitions used inside the grammar, for example {equals=equals}, where equals is a function.
`errorgen` - **EXPERIMENTAL** optional boolean parameter(default:false), when enabled generates error labels automatically. Works well only on LL(1) grammars. Custom error labels have precedence over automatically generated ones.
`noast` - optional boolean parameter(default:false), when enabled does not generate an AST for the parse.
*Output*:
`grammar` - a compiled grammar on success, throws error on failure.
### setlabels
If custom error labels are used, the function *setlabels* allows setting their description (and custom recovery pattern):
```lua
pg.setlabels(t)
```
Example table of a simple error and one with a custom recovery expression:
```lua
-- grammar rule: " ifexp <- 'if' exp 'then'^missingThen stmt 'end'^missingEnd "
local t = {
missingEnd = "Missing 'end' in if expression",
missingThen = {"Missing 'then' in if expression", " (!stmt .)* "} -- a custom recovery pattern
}
pg.setlabels(t)
```
If the recovery pattern is not set, then the one specified by the rule SYNC will be used. It is by default set to:
```lua
SKIP <- %s / %nl -- a space ' ' or newline '\n' character
SYNC <- .? (!SKIP .)*
```
Learn more about special rules in the grammar section.
### parse
This operation attempts to match a grammar to the given input.
```lua
result, errors = pg.parse(input, grammar [, errorfunction])
```
*Arguments*:
`input` - an input string that the tool will attempt to parse.
`grammar` - a compiled grammar.
`errorfunction` - an optional function that will be called if an error is encountered, with the arguments `desc` for the error description set using `setlabels()`; location indicators `line` and `col`; the remaining string before failure `sfail` and a custom recovery expression `trec` if available.
Example:
```lua
local errs = 0
local function printerror(desc,line,col,sfail,trec)
errs = errs+1
print("Error #"..errs..": "..desc.." before '"..sfail.."' on line "..line.."(col "..col..")")
end
result, errors = pg.parse(input,grammar,printerror)
```
*Output*:
If the parse is succesful, the function returns an abstract syntax tree containing the captures `result` and a table of any encountered `errors`. If the parse was unsuccessful, `result` is going to be **nil**.
Also, if the `noast` option is enabled when compiling the grammar, the function will then produce the longest match length or any custom captures used.
### calcline
Calculates line and column information regarding position i of the subject (exported from the relabel module).
```lua
line, col = pg.calcline(subject, position)
```
*Arguments*:
`subject` - subject string
`position` - position inside the string, for example, the one given by automatic AST generation.
### usenodes
When AST generation is enabled, this function will enable the "node" mode, where only rules tagged with a `node` prefix will generate AST entries. Must be used before compiling the grammar.
```lua
pg.usenodes(value)
```
*Arguments*:
`value` - a boolean value that enables or disables this function
# Grammar Syntax
The grammar used for this tool is described using a PEG-like syntax, that is identical to the one provided by the [re](http://www.inf.puc-rio.br/~roberto/lpeg/re.html) module, with an extension of labelled failures provided by [relabel](https://github.com/sqmedeiros/lpeglabel) module (except numbered labels). That is, all grammars that work with relabel should work with parser-gen as long as numbered error labels are not used, as they are not supported by parser-gen.
Since a parser generated with parser-gen automatically consumes space characters, builds ASTs and generates errors, additional extensions have been added based on the [ANTLR](http://www.antlr.org/) syntax.
### Basic syntax
The syntax of parser-gen grammars is somewhat similar to regex syntax. The next table summarizes the tools syntax. A p represents an arbitrary pattern; num represents a number (`[0-9]+`); name represents an identifier (`[a-zA-Z][a-zA-Z0-9_]*`).`defs` is the definitions table provided when compiling the grammar. Note that error names must be set using `setlabels()` before compiling the grammar. Constructions are listed in order of decreasing precedence.
<table border="1">
<tbody><tr><td><b>Syntax</b></td><td><b>Description</b></td></tr>
<tr><td><code>( p )</code></td> <td>grouping</td></tr>
<tr><td><code>'string'</code></td> <td>literal string</td></tr>
<tr><td><code>"string"</code></td> <td>literal string</td></tr>
<tr><td><code>[class]</code></td> <td>character class</td></tr>
<tr><td><code>.</code></td> <td>any character</td></tr>
<tr><td><code>%name</code></td>
<td>pattern <code>defs[name]</code> or a pre-defined pattern</td></tr>
<tr><td><code>name</code></td><td>non terminal</td></tr>
<tr><td><code>&lt;name&gt;</code></td><td>non terminal</td></tr>
<tr><td><code>%{name}</code></td> <td>error label</td></tr>
<tr><td><code>{}</code></td> <td>position capture</td></tr>
<tr><td><code>{ p }</code></td> <td>simple capture</td></tr>
<tr><td><code>{: p :}</code></td> <td>anonymous group capture</td></tr>
<tr><td><code>{:name: p :}</code></td> <td>named group capture</td></tr>
<tr><td><code>{~ p ~}</code></td> <td>substitution capture</td></tr>
<tr><td><code>{| p |}</code></td> <td>table capture</td></tr>
<tr><td><code>=name</code></td> <td>back reference
</td></tr>
<tr><td><code>p ?</code></td> <td>optional match</td></tr>
<tr><td><code>p *</code></td> <td>zero or more repetitions</td></tr>
<tr><td><code>p +</code></td> <td>one or more repetitions</td></tr>
<tr><td><code>p^num</code></td> <td>exactly <code>n</code> repetitions</td></tr>
<tr><td><code>p^+num</code></td>
<td>at least <code>n</code> repetitions</td></tr>
<tr><td><code>p^-num</code></td>
<td>at most <code>n</code> repetitions</td></tr>
<tr><td><code>p^name</code></td> <td>match p or throw error label name.</td></tr>
<tr><td><code>p -&gt; 'string'</code></td> <td>string capture</td></tr>
<tr><td><code>p -&gt; "string"</code></td> <td>string capture</td></tr>
<tr><td><code>p -&gt; num</code></td> <td>numbered capture</td></tr>
<tr><td><code>p -&gt; name</code></td> <td>function/query/string capture
equivalent to <code>p / defs[name]</code></td></tr>
<tr><td><code>p =&gt; name</code></td> <td>match-time capture
equivalent to <code>lpeg.Cmt(p, defs[name])</code></td></tr>
<tr><td><code>&amp; p</code></td> <td>and predicate</td></tr>
<tr><td><code>! p</code></td> <td>not predicate</td></tr>
<tr><td><code>p1 p2</code></td> <td>concatenation</td></tr>
<tr><td><code>p1 //{name [, name, ...]} p2</code></td> <td>specifies recovery pattern p2 for p1
when one of the labels is thrown</td></tr>
<tr><td><code>p1 / p2</code></td> <td>ordered choice</td></tr>
<tr><td>(<code>name &lt;- p</code>)<sup>+</sup></td> <td>grammar</td></tr>
</tbody></table>
The grammar below is used to match balanced parenthesis
```lua
balanced <- "(" ([^()] / balanced)* ")"
```
For more examples check out the [re](http://www.inf.puc-rio.br/~roberto/lpeg/re.html) page, see the Tiny parser below or the [Lua parser](https://github.com/vsbenas/parser-gen/blob/master/parsers/lua-parser.lua) writen with this tool.
### Error labels
Error labels are provided by the relabel function %{errorname} (errorname must follow `[A-Za-z][A-Za-z0-9_]*` format). Usually we use error labels in a syntax like `'a' ('b' / %{errB}) 'c'`, which throws an error label if `'b'` is not matched. This syntax is quite complicated so an additional syntax is allowed `'a' 'b'^errB 'c'`, which allows cleaner description of grammars. Note: all errors must be defined in a table using parser-gen.setlabels() before compiling and parsing the grammar.
### Tokens
Non-terminals with names in all capital letters, i.e. `[A-Z]+`, are considered tokens and are treated as a single object in parsing. That is, the whole string matched by a token is captured in a single AST entry and space characters are not consumed. Consider two examples:
```lua
-- a token non-terminal
grammar = pg.compile [[
WORD <- [A-Z]+
]]
res, _ = pg.parse("AA A", grammar) -- outputs {rule="WORD", "AA"}
```
```lua
-- a non-token non-terminal
grammar = pg.compile [[
word <- [A-Z]+
]]
res, _ = pg.parse("AA A", grammar) -- outputs {rule="word", "A", "A", "A"}
```
### Fragments
If a token definition is followed by a `fragment` keyword, then the parser does not build an AST entry for that token. Essentially, these rules are used to simplify grammars without building unnecessarily complicated ASTS. Example of `fragment` usage:
```lua
grammar = pg.compile [[
WORD <- LETTER+
fragment LETTER <- [A-Z]
]]
res, _ = pg.parse("AA A", grammar) -- outputs {rule="WORD", "AA"}
```
Without using `fragment`:
```lua
grammar = pg.compile [[
WORD <- LETTER+
LETTER <- [A-Z]
]]
res, _ = pg.parse("AA A", grammar) -- outputs {rule="WORD", {rule="LETTER", "A"}, {rule="LETTER", "A"}}
```
### Nodes
When node mode is enabled using `pg.usenodes(true)` only rules prefixed with a `node` keyword will generate AST entries:
```lua
grammar = pg.compile [[
node WORD <- LETTER+
LETTER <- [A-Z]
]]
res, _ = pg.parse("AA A", grammar) -- outputs {rule="WORD", "AA"}
```
### Special rules
There are two special rules used by the grammar:
#### SKIP
The `SKIP` rule identifies which characters to skip in a grammar. For example, most programming languages do not take into acount any space or newline characters. By default, SKIP is set to:
```lua
SKIP <- %s / %nl
```
This rule can be extended to contain semicolons `';'`, comments, or any other patterns that the parser can safely ignore.
Character skipping can be disabled by using:
```lua
SKIP <- ''
```
#### SYNC
This rule specifies the general recovery expression both for custom errors and automatically generated ones. By default:
```lua
SYNC <- .? (!SKIP .)*
```
The default SYNC rule consumes any characters until the next character matched by SKIP, usually a space or a newline. That means, if some statement in a program is invalid, the parser will continue parsing after a space or a newline character.
For some programming languages it might be useful to skip to a semicolon or a keyword, since they usually indicate the end of a statement, so SYNC could be something like:
```lua
HELPER <- ';' / 'end' / SKIP -- etc
SYNC <- (!HELPER .)* SKIP* -- we can consume the spaces after syncing with them as well
```
Recovery grammars can be disabled by using:
```lua
SYNC <- ''
```
# Example: Tiny parser
Below is the full code from *parsers/tiny-parser.lua*:
```lua
local pg = require "parser-gen"
local peg = require "peg-parser"
local errs = {errMissingThen = "Missing Then"} -- one custom error
pg.setlabels(errs)
--warning: experimental error generation function is enabled. If the grammar isn't LL(1), set errorgen to false
local errorgen = true
local grammar = pg.compile([[
program <- stmtsequence !.
stmtsequence <- statement (';' statement)*
statement <- ifstmt / repeatstmt / assignstmt / readstmt / writestmt
ifstmt <- 'if' exp 'then'^errMissingThen stmtsequence elsestmt? 'end'
elsestmt <- ('else' stmtsequence)
repeatstmt <- 'repeat' stmtsequence 'until' exp
assignstmt <- IDENTIFIER ':=' exp
readstmt <- 'read' IDENTIFIER
writestmt <- 'write' exp
exp <- simpleexp (COMPARISONOP simpleexp)*
COMPARISONOP <- '<' / '='
simpleexp <- term (ADDOP term)*
ADDOP <- [+-]
term <- factor (MULOP factor)*
MULOP <- [*/]
factor <- '(' exp ')' / NUMBER / IDENTIFIER
NUMBER <- '-'? [0-9]+
KEYWORDS <- 'if' / 'repeat' / 'read' / 'write' / 'then' / 'else' / 'end' / 'until'
RESERVED <- KEYWORDS ![a-zA-Z]
IDENTIFIER <- !RESERVED [a-zA-Z]+
HELPER <- ';' / %nl / %s / KEYWORDS / !.
SYNC <- (!HELPER .)*
]], _, errorgen)
local errors = 0
local function printerror(desc,line,col,sfail,trec)
errors = errors+1
print("Error #"..errors..": "..desc.." on line "..line.."(col "..col..")")
end
local function parse(input)
errors = 0
result, errors = pg.parse(input,grammar,printerror)
return result, errors
end
if arg[1] then
-- argument must be in quotes if it contains spaces
res, errs = parse(arg[1])
peg.print_t(res)
peg.print_r(errs)
end
local ret = {parse=parse}
return ret
```
For input: `lua tiny-parser-nocap.lua "if a b:=1"` we get:
```lua
Error #1: Missing Then on line 1(col 6)
Error #2: Expected stmtsequence on line 1(col 9)
Error #3: Expected 'end' on line 1(col 9)
-- ast:
rule='program',
pos=1,
{
rule='stmtsequence',
pos=1,
{
rule='statement',
pos=1,
{
rule='ifstmt',
pos=1,
'if',
{
rule='exp',
pos=4,
{
rule='simpleexp',
pos=4,
{
rule='term',
pos=4,
{
rule='factor',
pos=4,
{
rule='IDENTIFIER',
pos=4,
'a',
},
},
},
},
},
},
},
},
-- error table:
[1] => {
[msg] => 'Missing Then' -- custom error is used over the automatically generated one
[line] => '1'
[col] => '6'
[label] => 'errMissingThen'
}
[2] => {
[msg] => 'Expected stmtsequence' -- automatically generated errors
[line] => '1'
[col] => '9'
[label] => 'errorgen6'
}
[3] => {
[msg] => 'Expected 'end''
[line] => '1'
[col] => '9'
[label] => 'errorgen4'
}
```