# parser-gen A Lua parser generator that makes it possible to describe grammars in a [PEG](https://en.wikipedia.org/wiki/Parsing_expression_grammar) syntax. The tool will parse a given input using a provided grammar and if the matching is successful produce an AST as an output with the captured values using [Lpeg](http://www.inf.puc-rio.br/~roberto/lpeg/). If the matching fails, labelled errors can be used in the grammar to indicate failure position, and recovery grammars are generated to continue parsing the input using [LpegLabel](https://github.com/sqmedeiros/lpeglabel). The tool can also automatically generate error labels and recovery grammars for LL(1) grammars. parser-gen is a [GSoC 2017](https://developers.google.com/open-source/gsoc/) project, and was completed with the help of my mentor [@sqmedeiros](https://github.com/sqmedeiros) from [LabLua](http://www.lua.inf.puc-rio.br/). A blog documenting the progress of the project can be found [here](https://parsergen.blogspot.com/2017/08/parser-generator-based-on-lpeglabel.html). --- # Table of contents * [Requirements](#requirements) * [Syntax](#syntax) * [Grammar Syntax](#grammar-syntax) * [Example: Tiny Parser](#example-tiny-parser) # Requirements ``` lua >= 5.1 lpeglabel >= 1.2.0 ``` # Syntax ### compile This function generates a PEG parser from the grammar description. ```lua local pg = require "parser-gen" grammar = pg.compile(input,definitions [, errorgen, noast]) ``` *Arguments*: `input` - A string containing a PEG grammar description. For complete PEG syntax see the grammar section of this document. `definitions` - table of custom functions and definitions used inside the grammar, for example {equals=equals}, where equals is a function. `errorgen` - **EXPERIMENTAL** optional boolean parameter(default:false), when enabled generates error labels automatically. Works well only on LL(1) grammars. Custom error labels have precedence over automatically generated ones. `noast` - optional boolean parameter(default:false), when enabled does not generate an AST for the parse. *Output*: `grammar` - a compiled grammar on success, throws error on failure. ### setlabels If custom error labels are used, the function *setlabels* allows setting their description (and custom recovery pattern): ```lua pg.setlabels(t) ``` Example table of a simple error and one with a custom recovery expression: ```lua -- grammar rule: " ifexp <- 'if' exp 'then'^missingThen stmt 'end'^missingEnd " local t = { missingEnd = "Missing 'end' in if expression", missingThen = {"Missing 'then' in if expression", " (!stmt .)* "} -- a custom recovery pattern } pg.setlabels(t) ``` If the recovery pattern is not set, then the one specified by the rule SYNC will be used. It is by default set to: ```lua SKIP <- %s / %nl -- a space ' ' or newline '\n' character SYNC <- .? (!SKIP .)* ``` Learn more about special rules in the grammar section. ### parse This operation attempts to match a grammar to the given input. ```lua result, errors = pg.parse(input, grammar [, errorfunction]) ``` *Arguments*: `input` - an input string that the tool will attempt to parse. `grammar` - a compiled grammar. `errorfunction` - an optional function that will be called if an error is encountered, with the arguments `desc` for the error description set using `setlabels()`; location indicators `line` and `col`; the remaining string before failure `sfail` and a custom recovery expression `trec` if available. Example: ```lua local errs = 0 local function printerror(desc,line,col,sfail,trec) errs = errs+1 print("Error #"..errs..": "..desc.." before '"..sfail.."' on line "..line.."(col "..col..")") end result, errors = pg.parse(input,grammar,printerror) ``` *Output*: If the parse is succesful, the function returns an abstract syntax tree containing the captures `result` and a table of any encountered `errors`. If the parse was unsuccessful, `result` is going to be **nil**. Also, if the `noast` option is enabled when compiling the grammar, the function will then produce the longest match length or any custom captures used. ### calcline Calculates line and column information regarding position i of the subject (exported from the relabel module). ```lua line, col = pg.calcline(subject, position) ``` *Arguments*: `subject` - subject string `position` - position inside the string, for example, the one given by automatic AST generation. ### usenodes When AST generation is enabled, this function will enable the "node" mode, where only rules tagged with a `node` prefix will generate AST entries. Must be used before compiling the grammar. ```lua pg.usenodes(value) ``` *Arguments*: `value` - a boolean value that enables or disables this function # Grammar Syntax The grammar used for this tool is described using a PEG-like syntax, that is identical to the one provided by the [re](http://www.inf.puc-rio.br/~roberto/lpeg/re.html) module, with an extension of labelled failures provided by [relabel](https://github.com/sqmedeiros/lpeglabel) module (except numbered labels). That is, all grammars that work with relabel should work with parser-gen as long as numbered error labels are not used, as they are not supported by parser-gen. Since a parser generated with parser-gen automatically consumes space characters, builds ASTs and generates errors, additional extensions have been added based on the [ANTLR](http://www.antlr.org/) syntax. ### Basic syntax The syntax of parser-gen grammars is somewhat similar to regex syntax. The next table summarizes the tools syntax. A p represents an arbitrary pattern; num represents a number (`[0-9]+`); name represents an identifier (`[a-zA-Z][a-zA-Z0-9_]*`).`defs` is the definitions table provided when compiling the grammar. Note that error names must be set using `setlabels()` before compiling the grammar. Constructions are listed in order of decreasing precedence.
Syntax | Description |
( p ) | grouping |
'string' | literal string |
"string" | literal string |
[class] | character class |
. | any character |
%name |
pattern defs[name] or a pre-defined pattern |
name | non terminal |
<name> | non terminal |
%{name} | error label |
{} | position capture |
{ p } | simple capture |
{: p :} | anonymous group capture |
{:name: p :} | named group capture |
{~ p ~} | substitution capture |
{| p |} | table capture |
=name | back reference |
p ? | optional match |
p * | zero or more repetitions |
p + | one or more repetitions |
p^num | exactly n repetitions |
p^+num |
at least n repetitions |
p^-num |
at most n repetitions |
p^name | match p or throw error label name. |
p -> 'string' | string capture |
p -> "string" | string capture |
p -> num | numbered capture |
p -> name | function/query/string capture
equivalent to p / defs[name] |
p => name | match-time capture
equivalent to lpeg.Cmt(p, defs[name]) |
& p | and predicate |
! p | not predicate |
p1 p2 | concatenation |
p1 //{name [, name, ...]} p2 | specifies recovery pattern p2 for p1 when one of the labels is thrown |
p1 / p2 | ordered choice |
(name <- p )+ | grammar |