Fixed some diagnostics warnings

Moved examples to tofix because fixing them is besides the point right
now.
This commit is contained in:
Dawid Sobczak 2023-09-19 11:42:10 +01:00
parent 52164c82e3
commit 858fe11666
166 changed files with 68 additions and 264 deletions

View file

@ -0,0 +1,397 @@
# parser-gen
A Lua parser generator that makes it possible to describe grammars in a [PEG](https://en.wikipedia.org/wiki/Parsing_expression_grammar) syntax. The tool will parse a given input using a provided grammar and if the matching is successful produce an AST as an output with the captured values using [Lpeg](http://www.inf.puc-rio.br/~roberto/lpeg/). If the matching fails, labelled errors can be used in the grammar to indicate failure position, and recovery grammars are generated to continue parsing the input using [LpegLabel](https://github.com/sqmedeiros/lpeglabel). The tool can also automatically generate error labels and recovery grammars for LL(1) grammars.
parser-gen is a [GSoC 2017](https://developers.google.com/open-source/gsoc/) project, and was completed with the help of my mentor [@sqmedeiros](https://github.com/sqmedeiros) from [LabLua](http://www.lua.inf.puc-rio.br/). A blog documenting the progress of the project can be found [here](https://parsergen.blogspot.com/2017/08/parser-generator-based-on-lpeglabel.html).
---
# Table of contents
* [Requirements](#requirements)
* [Syntax](#syntax)
* [Grammar Syntax](#grammar-syntax)
* [Example: Tiny Parser](#example-tiny-parser)
# Requirements
```
lua >= 5.1
lpeglabel >= 1.2.0
```
# Syntax
### compile
This function generates a PEG parser from the grammar description.
```lua
local pg = require "parser-gen"
grammar = pg.compile(input,definitions [, errorgen, noast])
```
*Arguments*:
`input` - A string containing a PEG grammar description. For complete PEG syntax see the grammar section of this document.
`definitions` - table of custom functions and definitions used inside the grammar, for example {equals=equals}, where equals is a function.
`errorgen` - **EXPERIMENTAL** optional boolean parameter(default:false), when enabled generates error labels automatically. Works well only on LL(1) grammars. Custom error labels have precedence over automatically generated ones.
`noast` - optional boolean parameter(default:false), when enabled does not generate an AST for the parse.
*Output*:
`grammar` - a compiled grammar on success, throws error on failure.
### setlabels
If custom error labels are used, the function *setlabels* allows setting their description (and custom recovery pattern):
```lua
pg.setlabels(t)
```
Example table of a simple error and one with a custom recovery expression:
```lua
-- grammar rule: " ifexp <- 'if' exp 'then'^missingThen stmt 'end'^missingEnd "
local t = {
missingEnd = "Missing 'end' in if expression",
missingThen = {"Missing 'then' in if expression", " (!stmt .)* "} -- a custom recovery pattern
}
pg.setlabels(t)
```
If the recovery pattern is not set, then the one specified by the rule SYNC will be used. It is by default set to:
```lua
SKIP <- %s / %nl -- a space ' ' or newline '\n' character
SYNC <- .? (!SKIP .)*
```
Learn more about special rules in the grammar section.
### parse
This operation attempts to match a grammar to the given input.
```lua
result, errors = pg.parse(input, grammar [, errorfunction])
```
*Arguments*:
`input` - an input string that the tool will attempt to parse.
`grammar` - a compiled grammar.
`errorfunction` - an optional function that will be called if an error is encountered, with the arguments `desc` for the error description set using `setlabels()`; location indicators `line` and `col`; the remaining string before failure `sfail` and a custom recovery expression `trec` if available.
Example:
```lua
local errs = 0
local function printerror(desc,line,col,sfail,trec)
errs = errs+1
print("Error #"..errs..": "..desc.." before '"..sfail.."' on line "..line.."(col "..col..")")
end
result, errors = pg.parse(input,grammar,printerror)
```
*Output*:
If the parse is succesful, the function returns an abstract syntax tree containing the captures `result` and a table of any encountered `errors`. If the parse was unsuccessful, `result` is going to be **nil**.
Also, if the `noast` option is enabled when compiling the grammar, the function will then produce the longest match length or any custom captures used.
### calcline
Calculates line and column information regarding position i of the subject (exported from the relabel module).
```lua
line, col = pg.calcline(subject, position)
```
*Arguments*:
`subject` - subject string
`position` - position inside the string, for example, the one given by automatic AST generation.
### usenodes
When AST generation is enabled, this function will enable the "node" mode, where only rules tagged with a `node` prefix will generate AST entries. Must be used before compiling the grammar.
```lua
pg.usenodes(value)
```
*Arguments*:
`value` - a boolean value that enables or disables this function
# Grammar Syntax
The grammar used for this tool is described using a PEG-like syntax, that is identical to the one provided by the [re](http://www.inf.puc-rio.br/~roberto/lpeg/re.html) module, with an extension of labelled failures provided by [relabel](https://github.com/sqmedeiros/lpeglabel) module (except numbered labels). That is, all grammars that work with relabel should work with parser-gen as long as numbered error labels are not used, as they are not supported by parser-gen.
Since a parser generated with parser-gen automatically consumes space characters, builds ASTs and generates errors, additional extensions have been added based on the [ANTLR](http://www.antlr.org/) syntax.
### Basic syntax
The syntax of parser-gen grammars is somewhat similar to regex syntax. The next table summarizes the tools syntax. A p represents an arbitrary pattern; num represents a number (`[0-9]+`); name represents an identifier (`[a-zA-Z][a-zA-Z0-9_]*`).`defs` is the definitions table provided when compiling the grammar. Note that error names must be set using `setlabels()` before compiling the grammar. Constructions are listed in order of decreasing precedence.
<table border="1">
<tbody><tr><td><b>Syntax</b></td><td><b>Description</b></td></tr>
<tr><td><code>( p )</code></td> <td>grouping</td></tr>
<tr><td><code>'string'</code></td> <td>literal string</td></tr>
<tr><td><code>"string"</code></td> <td>literal string</td></tr>
<tr><td><code>[class]</code></td> <td>character class</td></tr>
<tr><td><code>.</code></td> <td>any character</td></tr>
<tr><td><code>%name</code></td>
<td>pattern <code>defs[name]</code> or a pre-defined pattern</td></tr>
<tr><td><code>name</code></td><td>non terminal</td></tr>
<tr><td><code>&lt;name&gt;</code></td><td>non terminal</td></tr>
<tr><td><code>%{name}</code></td> <td>error label</td></tr>
<tr><td><code>{}</code></td> <td>position capture</td></tr>
<tr><td><code>{ p }</code></td> <td>simple capture</td></tr>
<tr><td><code>{: p :}</code></td> <td>anonymous group capture</td></tr>
<tr><td><code>{:name: p :}</code></td> <td>named group capture</td></tr>
<tr><td><code>{~ p ~}</code></td> <td>substitution capture</td></tr>
<tr><td><code>{| p |}</code></td> <td>table capture</td></tr>
<tr><td><code>=name</code></td> <td>back reference
</td></tr>
<tr><td><code>p ?</code></td> <td>optional match</td></tr>
<tr><td><code>p *</code></td> <td>zero or more repetitions</td></tr>
<tr><td><code>p +</code></td> <td>one or more repetitions</td></tr>
<tr><td><code>p^num</code></td> <td>exactly <code>n</code> repetitions</td></tr>
<tr><td><code>p^+num</code></td>
<td>at least <code>n</code> repetitions</td></tr>
<tr><td><code>p^-num</code></td>
<td>at most <code>n</code> repetitions</td></tr>
<tr><td><code>p^name</code></td> <td>match p or throw error label name.</td></tr>
<tr><td><code>p -&gt; 'string'</code></td> <td>string capture</td></tr>
<tr><td><code>p -&gt; "string"</code></td> <td>string capture</td></tr>
<tr><td><code>p -&gt; num</code></td> <td>numbered capture</td></tr>
<tr><td><code>p -&gt; name</code></td> <td>function/query/string capture
equivalent to <code>p / defs[name]</code></td></tr>
<tr><td><code>p =&gt; name</code></td> <td>match-time capture
equivalent to <code>lpeg.Cmt(p, defs[name])</code></td></tr>
<tr><td><code>&amp; p</code></td> <td>and predicate</td></tr>
<tr><td><code>! p</code></td> <td>not predicate</td></tr>
<tr><td><code>p1 p2</code></td> <td>concatenation</td></tr>
<tr><td><code>p1 //{name [, name, ...]} p2</code></td> <td>specifies recovery pattern p2 for p1
when one of the labels is thrown</td></tr>
<tr><td><code>p1 / p2</code></td> <td>ordered choice</td></tr>
<tr><td>(<code>name &lt;- p</code>)<sup>+</sup></td> <td>grammar</td></tr>
</tbody></table>
The grammar below is used to match balanced parenthesis
```lua
balanced <- "(" ([^()] / balanced)* ")"
```
For more examples check out the [re](http://www.inf.puc-rio.br/~roberto/lpeg/re.html) page, see the Tiny parser below or the [Lua parser](https://github.com/vsbenas/parser-gen/blob/master/parsers/lua-parser.lua) writen with this tool.
### Error labels
Error labels are provided by the relabel function %{errorname} (errorname must follow `[A-Za-z][A-Za-z0-9_]*` format). Usually we use error labels in a syntax like `'a' ('b' / %{errB}) 'c'`, which throws an error label if `'b'` is not matched. This syntax is quite complicated so an additional syntax is allowed `'a' 'b'^errB 'c'`, which allows cleaner description of grammars. Note: all errors must be defined in a table using parser-gen.setlabels() before compiling and parsing the grammar.
### Tokens
Non-terminals with names in all capital letters, i.e. `[A-Z]+`, are considered tokens and are treated as a single object in parsing. That is, the whole string matched by a token is captured in a single AST entry and space characters are not consumed. Consider two examples:
```lua
-- a token non-terminal
grammar = pg.compile [[
WORD <- [A-Z]+
]]
res, _ = pg.parse("AA A", grammar) -- outputs {rule="WORD", "AA"}
```
```lua
-- a non-token non-terminal
grammar = pg.compile [[
word <- [A-Z]+
]]
res, _ = pg.parse("AA A", grammar) -- outputs {rule="word", "A", "A", "A"}
```
### Fragments
If a token definition is followed by a `fragment` keyword, then the parser does not build an AST entry for that token. Essentially, these rules are used to simplify grammars without building unnecessarily complicated ASTS. Example of `fragment` usage:
```lua
grammar = pg.compile [[
WORD <- LETTER+
fragment LETTER <- [A-Z]
]]
res, _ = pg.parse("AA A", grammar) -- outputs {rule="WORD", "AA"}
```
Without using `fragment`:
```lua
grammar = pg.compile [[
WORD <- LETTER+
LETTER <- [A-Z]
]]
res, _ = pg.parse("AA A", grammar) -- outputs {rule="WORD", {rule="LETTER", "A"}, {rule="LETTER", "A"}}
```
### Nodes
When node mode is enabled using `pg.usenodes(true)` only rules prefixed with a `node` keyword will generate AST entries:
```lua
grammar = pg.compile [[
node WORD <- LETTER+
LETTER <- [A-Z]
]]
res, _ = pg.parse("AA A", grammar) -- outputs {rule="WORD", "AA"}
```
### Special rules
There are two special rules used by the grammar:
#### SKIP
The `SKIP` rule identifies which characters to skip in a grammar. For example, most programming languages do not take into acount any space or newline characters. By default, SKIP is set to:
```lua
SKIP <- %s / %nl
```
This rule can be extended to contain semicolons `';'`, comments, or any other patterns that the parser can safely ignore.
Character skipping can be disabled by using:
```lua
SKIP <- ''
```
#### SYNC
This rule specifies the general recovery expression both for custom errors and automatically generated ones. By default:
```lua
SYNC <- .? (!SKIP .)*
```
The default SYNC rule consumes any characters until the next character matched by SKIP, usually a space or a newline. That means, if some statement in a program is invalid, the parser will continue parsing after a space or a newline character.
For some programming languages it might be useful to skip to a semicolon or a keyword, since they usually indicate the end of a statement, so SYNC could be something like:
```lua
HELPER <- ';' / 'end' / SKIP -- etc
SYNC <- (!HELPER .)* SKIP* -- we can consume the spaces after syncing with them as well
```
Recovery grammars can be disabled by using:
```lua
SYNC <- ''
```
# Example: Tiny parser
Below is the full code from *parsers/tiny-parser.lua*:
```lua
local pg = require "parser-gen"
local peg = require "peg-parser"
local errs = {errMissingThen = "Missing Then"} -- one custom error
pg.setlabels(errs)
--warning: experimental error generation function is enabled. If the grammar isn't LL(1), set errorgen to false
local errorgen = true
local grammar = pg.compile([[
program <- stmtsequence !.
stmtsequence <- statement (';' statement)*
statement <- ifstmt / repeatstmt / assignstmt / readstmt / writestmt
ifstmt <- 'if' exp 'then'^errMissingThen stmtsequence elsestmt? 'end'
elsestmt <- ('else' stmtsequence)
repeatstmt <- 'repeat' stmtsequence 'until' exp
assignstmt <- IDENTIFIER ':=' exp
readstmt <- 'read' IDENTIFIER
writestmt <- 'write' exp
exp <- simpleexp (COMPARISONOP simpleexp)*
COMPARISONOP <- '<' / '='
simpleexp <- term (ADDOP term)*
ADDOP <- [+-]
term <- factor (MULOP factor)*
MULOP <- [*/]
factor <- '(' exp ')' / NUMBER / IDENTIFIER
NUMBER <- '-'? [0-9]+
KEYWORDS <- 'if' / 'repeat' / 'read' / 'write' / 'then' / 'else' / 'end' / 'until'
RESERVED <- KEYWORDS ![a-zA-Z]
IDENTIFIER <- !RESERVED [a-zA-Z]+
HELPER <- ';' / %nl / %s / KEYWORDS / !.
SYNC <- (!HELPER .)*
]], _, errorgen)
local errors = 0
local function printerror(desc,line,col,sfail,trec)
errors = errors+1
print("Error #"..errors..": "..desc.." on line "..line.."(col "..col..")")
end
local function parse(input)
errors = 0
result, errors = pg.parse(input,grammar,printerror)
return result, errors
end
if arg[1] then
-- argument must be in quotes if it contains spaces
res, errs = parse(arg[1])
peg.print_t(res)
peg.print_r(errs)
end
local ret = {parse=parse}
return ret
```
For input: `lua tiny-parser-nocap.lua "if a b:=1"` we get:
```lua
Error #1: Missing Then on line 1(col 6)
Error #2: Expected stmtsequence on line 1(col 9)
Error #3: Expected 'end' on line 1(col 9)
-- ast:
rule='program',
pos=1,
{
rule='stmtsequence',
pos=1,
{
rule='statement',
pos=1,
{
rule='ifstmt',
pos=1,
'if',
{
rule='exp',
pos=4,
{
rule='simpleexp',
pos=4,
{
rule='term',
pos=4,
{
rule='factor',
pos=4,
{
rule='IDENTIFIER',
pos=4,
'a',
},
},
},
},
},
},
},
},
-- error table:
[1] => {
[msg] => 'Missing Then' -- custom error is used over the automatically generated one
[line] => '1'
[col] => '6'
[label] => 'errMissingThen'
}
[2] => {
[msg] => 'Expected stmtsequence' -- automatically generated errors
[line] => '1'
[col] => '9'
[label] => 'errorgen6'
}
[3] => {
[msg] => 'Expected 'end''
[line] => '1'
[col] => '9'
[label] => 'errorgen4'
}
```