Add stage 06: Lua bootstrap
The goal of stage 06 is to try parse zig synax in lua. I pulled in lpeglable 1.2.0 and parser-gen off github to get started. All of this needs to be cleaned up rather soon. Lua boostraps using tcc and musl from the previous stage. Since musl 0.6.0 doesn't support dynamic linking this build of lua doesn't support shared libraries. I couldn't easily patch musl with dlopen and friends so instead I link statically and call deps with c api.
This commit is contained in:
parent
2ae045cf8a
commit
e6b88d5a0f
170 changed files with 72518 additions and 2 deletions
397
06/parser-gen/README.md
Normal file
397
06/parser-gen/README.md
Normal file
|
@ -0,0 +1,397 @@
|
|||
# parser-gen
|
||||
|
||||
A Lua parser generator that makes it possible to describe grammars in a [PEG](https://en.wikipedia.org/wiki/Parsing_expression_grammar) syntax. The tool will parse a given input using a provided grammar and if the matching is successful produce an AST as an output with the captured values using [Lpeg](http://www.inf.puc-rio.br/~roberto/lpeg/). If the matching fails, labelled errors can be used in the grammar to indicate failure position, and recovery grammars are generated to continue parsing the input using [LpegLabel](https://github.com/sqmedeiros/lpeglabel). The tool can also automatically generate error labels and recovery grammars for LL(1) grammars.
|
||||
|
||||
parser-gen is a [GSoC 2017](https://developers.google.com/open-source/gsoc/) project, and was completed with the help of my mentor [@sqmedeiros](https://github.com/sqmedeiros) from [LabLua](http://www.lua.inf.puc-rio.br/). A blog documenting the progress of the project can be found [here](https://parsergen.blogspot.com/2017/08/parser-generator-based-on-lpeglabel.html).
|
||||
|
||||
---
|
||||
# Table of contents
|
||||
|
||||
* [Requirements](#requirements)
|
||||
|
||||
* [Syntax](#syntax)
|
||||
|
||||
* [Grammar Syntax](#grammar-syntax)
|
||||
|
||||
* [Example: Tiny Parser](#example-tiny-parser)
|
||||
|
||||
# Requirements
|
||||
```
|
||||
lua >= 5.1
|
||||
lpeglabel >= 1.2.0
|
||||
```
|
||||
# Syntax
|
||||
|
||||
### compile
|
||||
|
||||
This function generates a PEG parser from the grammar description.
|
||||
|
||||
```lua
|
||||
local pg = require "parser-gen"
|
||||
grammar = pg.compile(input,definitions [, errorgen, noast])
|
||||
```
|
||||
*Arguments*:
|
||||
|
||||
`input` - A string containing a PEG grammar description. For complete PEG syntax see the grammar section of this document.
|
||||
|
||||
`definitions` - table of custom functions and definitions used inside the grammar, for example {equals=equals}, where equals is a function.
|
||||
|
||||
`errorgen` - **EXPERIMENTAL** optional boolean parameter(default:false), when enabled generates error labels automatically. Works well only on LL(1) grammars. Custom error labels have precedence over automatically generated ones.
|
||||
|
||||
`noast` - optional boolean parameter(default:false), when enabled does not generate an AST for the parse.
|
||||
|
||||
*Output*:
|
||||
|
||||
`grammar` - a compiled grammar on success, throws error on failure.
|
||||
|
||||
### setlabels
|
||||
|
||||
If custom error labels are used, the function *setlabels* allows setting their description (and custom recovery pattern):
|
||||
```lua
|
||||
pg.setlabels(t)
|
||||
```
|
||||
Example table of a simple error and one with a custom recovery expression:
|
||||
```lua
|
||||
-- grammar rule: " ifexp <- 'if' exp 'then'^missingThen stmt 'end'^missingEnd "
|
||||
local t = {
|
||||
missingEnd = "Missing 'end' in if expression",
|
||||
missingThen = {"Missing 'then' in if expression", " (!stmt .)* "} -- a custom recovery pattern
|
||||
}
|
||||
pg.setlabels(t)
|
||||
```
|
||||
If the recovery pattern is not set, then the one specified by the rule SYNC will be used. It is by default set to:
|
||||
```lua
|
||||
SKIP <- %s / %nl -- a space ' ' or newline '\n' character
|
||||
SYNC <- .? (!SKIP .)*
|
||||
```
|
||||
Learn more about special rules in the grammar section.
|
||||
|
||||
### parse
|
||||
|
||||
This operation attempts to match a grammar to the given input.
|
||||
|
||||
```lua
|
||||
result, errors = pg.parse(input, grammar [, errorfunction])
|
||||
```
|
||||
*Arguments*:
|
||||
|
||||
`input` - an input string that the tool will attempt to parse.
|
||||
|
||||
`grammar` - a compiled grammar.
|
||||
|
||||
`errorfunction` - an optional function that will be called if an error is encountered, with the arguments `desc` for the error description set using `setlabels()`; location indicators `line` and `col`; the remaining string before failure `sfail` and a custom recovery expression `trec` if available.
|
||||
Example:
|
||||
```lua
|
||||
local errs = 0
|
||||
local function printerror(desc,line,col,sfail,trec)
|
||||
errs = errs+1
|
||||
print("Error #"..errs..": "..desc.." before '"..sfail.."' on line "..line.."(col "..col..")")
|
||||
end
|
||||
|
||||
result, errors = pg.parse(input,grammar,printerror)
|
||||
```
|
||||
*Output*:
|
||||
|
||||
If the parse is succesful, the function returns an abstract syntax tree containing the captures `result` and a table of any encountered `errors`. If the parse was unsuccessful, `result` is going to be **nil**.
|
||||
Also, if the `noast` option is enabled when compiling the grammar, the function will then produce the longest match length or any custom captures used.
|
||||
|
||||
### calcline
|
||||
|
||||
Calculates line and column information regarding position i of the subject (exported from the relabel module).
|
||||
|
||||
```lua
|
||||
line, col = pg.calcline(subject, position)
|
||||
```
|
||||
*Arguments*:
|
||||
|
||||
`subject` - subject string
|
||||
|
||||
`position` - position inside the string, for example, the one given by automatic AST generation.
|
||||
|
||||
### usenodes
|
||||
|
||||
When AST generation is enabled, this function will enable the "node" mode, where only rules tagged with a `node` prefix will generate AST entries. Must be used before compiling the grammar.
|
||||
|
||||
```lua
|
||||
pg.usenodes(value)
|
||||
```
|
||||
*Arguments*:
|
||||
|
||||
`value` - a boolean value that enables or disables this function
|
||||
|
||||
# Grammar Syntax
|
||||
|
||||
The grammar used for this tool is described using a PEG-like syntax, that is identical to the one provided by the [re](http://www.inf.puc-rio.br/~roberto/lpeg/re.html) module, with an extension of labelled failures provided by [relabel](https://github.com/sqmedeiros/lpeglabel) module (except numbered labels). That is, all grammars that work with relabel should work with parser-gen as long as numbered error labels are not used, as they are not supported by parser-gen.
|
||||
|
||||
Since a parser generated with parser-gen automatically consumes space characters, builds ASTs and generates errors, additional extensions have been added based on the [ANTLR](http://www.antlr.org/) syntax.
|
||||
|
||||
### Basic syntax
|
||||
|
||||
The syntax of parser-gen grammars is somewhat similar to regex syntax. The next table summarizes the tools syntax. A p represents an arbitrary pattern; num represents a number (`[0-9]+`); name represents an identifier (`[a-zA-Z][a-zA-Z0-9_]*`).`defs` is the definitions table provided when compiling the grammar. Note that error names must be set using `setlabels()` before compiling the grammar. Constructions are listed in order of decreasing precedence.
|
||||
|
||||
<table border="1">
|
||||
<tbody><tr><td><b>Syntax</b></td><td><b>Description</b></td></tr>
|
||||
<tr><td><code>( p )</code></td> <td>grouping</td></tr>
|
||||
<tr><td><code>'string'</code></td> <td>literal string</td></tr>
|
||||
<tr><td><code>"string"</code></td> <td>literal string</td></tr>
|
||||
<tr><td><code>[class]</code></td> <td>character class</td></tr>
|
||||
<tr><td><code>.</code></td> <td>any character</td></tr>
|
||||
<tr><td><code>%name</code></td>
|
||||
<td>pattern <code>defs[name]</code> or a pre-defined pattern</td></tr>
|
||||
<tr><td><code>name</code></td><td>non terminal</td></tr>
|
||||
<tr><td><code><name></code></td><td>non terminal</td></tr>
|
||||
<tr><td><code>%{name}</code></td> <td>error label</td></tr>
|
||||
<tr><td><code>{}</code></td> <td>position capture</td></tr>
|
||||
<tr><td><code>{ p }</code></td> <td>simple capture</td></tr>
|
||||
<tr><td><code>{: p :}</code></td> <td>anonymous group capture</td></tr>
|
||||
<tr><td><code>{:name: p :}</code></td> <td>named group capture</td></tr>
|
||||
<tr><td><code>{~ p ~}</code></td> <td>substitution capture</td></tr>
|
||||
<tr><td><code>{| p |}</code></td> <td>table capture</td></tr>
|
||||
<tr><td><code>=name</code></td> <td>back reference
|
||||
</td></tr>
|
||||
<tr><td><code>p ?</code></td> <td>optional match</td></tr>
|
||||
<tr><td><code>p *</code></td> <td>zero or more repetitions</td></tr>
|
||||
<tr><td><code>p +</code></td> <td>one or more repetitions</td></tr>
|
||||
<tr><td><code>p^num</code></td> <td>exactly <code>n</code> repetitions</td></tr>
|
||||
<tr><td><code>p^+num</code></td>
|
||||
<td>at least <code>n</code> repetitions</td></tr>
|
||||
<tr><td><code>p^-num</code></td>
|
||||
<td>at most <code>n</code> repetitions</td></tr>
|
||||
<tr><td><code>p^name</code></td> <td>match p or throw error label name.</td></tr>
|
||||
<tr><td><code>p -> 'string'</code></td> <td>string capture</td></tr>
|
||||
<tr><td><code>p -> "string"</code></td> <td>string capture</td></tr>
|
||||
<tr><td><code>p -> num</code></td> <td>numbered capture</td></tr>
|
||||
<tr><td><code>p -> name</code></td> <td>function/query/string capture
|
||||
equivalent to <code>p / defs[name]</code></td></tr>
|
||||
<tr><td><code>p => name</code></td> <td>match-time capture
|
||||
equivalent to <code>lpeg.Cmt(p, defs[name])</code></td></tr>
|
||||
<tr><td><code>& p</code></td> <td>and predicate</td></tr>
|
||||
<tr><td><code>! p</code></td> <td>not predicate</td></tr>
|
||||
<tr><td><code>p1 p2</code></td> <td>concatenation</td></tr>
|
||||
<tr><td><code>p1 //{name [, name, ...]} p2</code></td> <td>specifies recovery pattern p2 for p1
|
||||
when one of the labels is thrown</td></tr>
|
||||
<tr><td><code>p1 / p2</code></td> <td>ordered choice</td></tr>
|
||||
<tr><td>(<code>name <- p</code>)<sup>+</sup></td> <td>grammar</td></tr>
|
||||
</tbody></table>
|
||||
|
||||
|
||||
The grammar below is used to match balanced parenthesis
|
||||
|
||||
```lua
|
||||
balanced <- "(" ([^()] / balanced)* ")"
|
||||
```
|
||||
For more examples check out the [re](http://www.inf.puc-rio.br/~roberto/lpeg/re.html) page, see the Tiny parser below or the [Lua parser](https://github.com/vsbenas/parser-gen/blob/master/parsers/lua-parser.lua) writen with this tool.
|
||||
|
||||
### Error labels
|
||||
|
||||
Error labels are provided by the relabel function %{errorname} (errorname must follow `[A-Za-z][A-Za-z0-9_]*` format). Usually we use error labels in a syntax like `'a' ('b' / %{errB}) 'c'`, which throws an error label if `'b'` is not matched. This syntax is quite complicated so an additional syntax is allowed `'a' 'b'^errB 'c'`, which allows cleaner description of grammars. Note: all errors must be defined in a table using parser-gen.setlabels() before compiling and parsing the grammar.
|
||||
|
||||
### Tokens
|
||||
|
||||
Non-terminals with names in all capital letters, i.e. `[A-Z]+`, are considered tokens and are treated as a single object in parsing. That is, the whole string matched by a token is captured in a single AST entry and space characters are not consumed. Consider two examples:
|
||||
```lua
|
||||
-- a token non-terminal
|
||||
grammar = pg.compile [[
|
||||
WORD <- [A-Z]+
|
||||
]]
|
||||
res, _ = pg.parse("AA A", grammar) -- outputs {rule="WORD", "AA"}
|
||||
```
|
||||
```lua
|
||||
-- a non-token non-terminal
|
||||
grammar = pg.compile [[
|
||||
word <- [A-Z]+
|
||||
]]
|
||||
res, _ = pg.parse("AA A", grammar) -- outputs {rule="word", "A", "A", "A"}
|
||||
```
|
||||
|
||||
### Fragments
|
||||
|
||||
If a token definition is followed by a `fragment` keyword, then the parser does not build an AST entry for that token. Essentially, these rules are used to simplify grammars without building unnecessarily complicated ASTS. Example of `fragment` usage:
|
||||
```lua
|
||||
grammar = pg.compile [[
|
||||
WORD <- LETTER+
|
||||
fragment LETTER <- [A-Z]
|
||||
]]
|
||||
res, _ = pg.parse("AA A", grammar) -- outputs {rule="WORD", "AA"}
|
||||
```
|
||||
Without using `fragment`:
|
||||
```lua
|
||||
grammar = pg.compile [[
|
||||
WORD <- LETTER+
|
||||
LETTER <- [A-Z]
|
||||
]]
|
||||
res, _ = pg.parse("AA A", grammar) -- outputs {rule="WORD", {rule="LETTER", "A"}, {rule="LETTER", "A"}}
|
||||
|
||||
```
|
||||
|
||||
### Nodes
|
||||
|
||||
When node mode is enabled using `pg.usenodes(true)` only rules prefixed with a `node` keyword will generate AST entries:
|
||||
```lua
|
||||
grammar = pg.compile [[
|
||||
node WORD <- LETTER+
|
||||
LETTER <- [A-Z]
|
||||
]]
|
||||
res, _ = pg.parse("AA A", grammar) -- outputs {rule="WORD", "AA"}
|
||||
```
|
||||
### Special rules
|
||||
|
||||
There are two special rules used by the grammar:
|
||||
|
||||
#### SKIP
|
||||
|
||||
The `SKIP` rule identifies which characters to skip in a grammar. For example, most programming languages do not take into acount any space or newline characters. By default, SKIP is set to:
|
||||
```lua
|
||||
SKIP <- %s / %nl
|
||||
```
|
||||
This rule can be extended to contain semicolons `';'`, comments, or any other patterns that the parser can safely ignore.
|
||||
|
||||
Character skipping can be disabled by using:
|
||||
```lua
|
||||
SKIP <- ''
|
||||
```
|
||||
|
||||
#### SYNC
|
||||
|
||||
This rule specifies the general recovery expression both for custom errors and automatically generated ones. By default:
|
||||
```lua
|
||||
SYNC <- .? (!SKIP .)*
|
||||
```
|
||||
The default SYNC rule consumes any characters until the next character matched by SKIP, usually a space or a newline. That means, if some statement in a program is invalid, the parser will continue parsing after a space or a newline character.
|
||||
|
||||
For some programming languages it might be useful to skip to a semicolon or a keyword, since they usually indicate the end of a statement, so SYNC could be something like:
|
||||
```lua
|
||||
HELPER <- ';' / 'end' / SKIP -- etc
|
||||
SYNC <- (!HELPER .)* SKIP* -- we can consume the spaces after syncing with them as well
|
||||
```
|
||||
|
||||
Recovery grammars can be disabled by using:
|
||||
```lua
|
||||
SYNC <- ''
|
||||
```
|
||||
# Example: Tiny parser
|
||||
|
||||
Below is the full code from *parsers/tiny-parser.lua*:
|
||||
```lua
|
||||
local pg = require "parser-gen"
|
||||
local peg = require "peg-parser"
|
||||
local errs = {errMissingThen = "Missing Then"} -- one custom error
|
||||
pg.setlabels(errs)
|
||||
|
||||
--warning: experimental error generation function is enabled. If the grammar isn't LL(1), set errorgen to false
|
||||
local errorgen = true
|
||||
|
||||
local grammar = pg.compile([[
|
||||
|
||||
program <- stmtsequence !.
|
||||
stmtsequence <- statement (';' statement)*
|
||||
statement <- ifstmt / repeatstmt / assignstmt / readstmt / writestmt
|
||||
ifstmt <- 'if' exp 'then'^errMissingThen stmtsequence elsestmt? 'end'
|
||||
elsestmt <- ('else' stmtsequence)
|
||||
repeatstmt <- 'repeat' stmtsequence 'until' exp
|
||||
assignstmt <- IDENTIFIER ':=' exp
|
||||
readstmt <- 'read' IDENTIFIER
|
||||
writestmt <- 'write' exp
|
||||
exp <- simpleexp (COMPARISONOP simpleexp)*
|
||||
COMPARISONOP <- '<' / '='
|
||||
simpleexp <- term (ADDOP term)*
|
||||
ADDOP <- [+-]
|
||||
term <- factor (MULOP factor)*
|
||||
MULOP <- [*/]
|
||||
factor <- '(' exp ')' / NUMBER / IDENTIFIER
|
||||
|
||||
NUMBER <- '-'? [0-9]+
|
||||
KEYWORDS <- 'if' / 'repeat' / 'read' / 'write' / 'then' / 'else' / 'end' / 'until'
|
||||
RESERVED <- KEYWORDS ![a-zA-Z]
|
||||
IDENTIFIER <- !RESERVED [a-zA-Z]+
|
||||
HELPER <- ';' / %nl / %s / KEYWORDS / !.
|
||||
SYNC <- (!HELPER .)*
|
||||
|
||||
]], _, errorgen)
|
||||
|
||||
local errors = 0
|
||||
local function printerror(desc,line,col,sfail,trec)
|
||||
errors = errors+1
|
||||
print("Error #"..errors..": "..desc.." on line "..line.."(col "..col..")")
|
||||
end
|
||||
|
||||
|
||||
local function parse(input)
|
||||
errors = 0
|
||||
result, errors = pg.parse(input,grammar,printerror)
|
||||
return result, errors
|
||||
end
|
||||
|
||||
if arg[1] then
|
||||
-- argument must be in quotes if it contains spaces
|
||||
res, errs = parse(arg[1])
|
||||
peg.print_t(res)
|
||||
peg.print_r(errs)
|
||||
end
|
||||
local ret = {parse=parse}
|
||||
return ret
|
||||
```
|
||||
For input: `lua tiny-parser-nocap.lua "if a b:=1"` we get:
|
||||
```lua
|
||||
Error #1: Missing Then on line 1(col 6)
|
||||
Error #2: Expected stmtsequence on line 1(col 9)
|
||||
Error #3: Expected 'end' on line 1(col 9)
|
||||
-- ast:
|
||||
rule='program',
|
||||
pos=1,
|
||||
{
|
||||
rule='stmtsequence',
|
||||
pos=1,
|
||||
{
|
||||
rule='statement',
|
||||
pos=1,
|
||||
{
|
||||
rule='ifstmt',
|
||||
pos=1,
|
||||
'if',
|
||||
{
|
||||
rule='exp',
|
||||
pos=4,
|
||||
{
|
||||
rule='simpleexp',
|
||||
pos=4,
|
||||
{
|
||||
rule='term',
|
||||
pos=4,
|
||||
{
|
||||
rule='factor',
|
||||
pos=4,
|
||||
{
|
||||
rule='IDENTIFIER',
|
||||
pos=4,
|
||||
'a',
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
-- error table:
|
||||
[1] => {
|
||||
[msg] => 'Missing Then' -- custom error is used over the automatically generated one
|
||||
[line] => '1'
|
||||
[col] => '6'
|
||||
[label] => 'errMissingThen'
|
||||
}
|
||||
[2] => {
|
||||
[msg] => 'Expected stmtsequence' -- automatically generated errors
|
||||
[line] => '1'
|
||||
[col] => '9'
|
||||
[label] => 'errorgen6'
|
||||
}
|
||||
[3] => {
|
||||
[msg] => 'Expected 'end''
|
||||
[line] => '1'
|
||||
[col] => '9'
|
||||
[label] => 'errorgen4'
|
||||
}
|
||||
```
|
||||
|
||||
|
Loading…
Add table
Add a link
Reference in a new issue