A Modular SGLR Parsing Architecture for Systematic Performance Optimization
LR parsing and its deterministic variants only support a subset of the class of context-free grammars. A more powerful algorithm is required to e.g. support composable syntax definition (for grammar modularity and reusability), because only the full class of context-free grammars is closed under composition. GLR parsing is such an algorithm. It is an extension of LR parsing that depends on LR parser generation and can handle the full class of context-free grammars, including ambiguous ones and grammars that require unbounded lookahead.
To support language composition, conflicts in the interface between lexical and context-free need to be avoided. Scannerless parsing solves this issue. Traditional parsing techniques work with a separate lexical and context-free analysis phase, where in the lexical analysis phase a scanner strips layout and transforms characters into tokens which are then processed by a context-free parser. Scannerless parsing uses syntax definition in which the lexical part of the grammar is integrated in the context-free part, increasing declarativeness and grammar maintainability. SGLR parsing combines GLR with scannerless parsing, discarding the scanner and leading to a simpler parser architecture. Resulting parse trees include lexical tokens that are expanded subtrees with characters on leaf nodes, thus imploding parse forests to abstract syntax trees also incorporates tokenization.
While SGLR is an advanced and flexible approach to parsing, it suffers from performance issues when compared to other approaches like ANTLR. Multiple performance improvements are found for GLR parsing which can be applied to SGLR too. To systematically evaluate such improvements, a modular architecture is desired in which its components can be replaced by improved variants without influencing the rest of the parser. Those variants can include algorithmic, data structure and general software engineering improvements.
This work presents a modular architecture for SGLR parsing, an implementation of this architecture with several variants in the Spoofax Language Workbench and a systematic benchmarking approach to compare the variants.
We present a SGLR decomposition in four components: stack representation and operations, parse forest representation and operations, reducing and the parser itself which depends on compatible implementations of the first three components. This architecture, including imploding and tokenization, is implemented in Java in Spoofax as JSGLR2, the successor of the current parser implementation JSGLR. It uses parse tables generated by the syntax definition formalism SDF.
The architecture enables the systematic benchmarking of several implementations by parsing inputs for multiple real world programming languages and artificial grammars with specific properties. Variants are evaluated by replacing only the components they impact, leaving the others intact. Currently two variants are implemented to achieve better performance as compared to the naive implementation of the original SGLR algorithm which serves as the baseline.
First, reducing is improved by implementing a hybrid LR/GLR approach as presented in Elkhound. It detects which parts of the input are deterministic and switches to regular LR parsing where possible, discarding the overhead involved with GLR. The mechanism to detect determinism is the maintaining of the deterministic depth of stacks. If the parser reduces a production, LR can be used if the deterministic depth is at least as big as the number of elements on the right hand side of the production. Our architecture enables this variant by replacing the stack representation and by using a specialized reducing component, independently of the parse forest representation.
Second, a more efficient parse forest representation is implemented that includes software engineering improvements. The original parse forest representation contained lists for references to derivations which often are only one. The improved variant embeds first derivations as a field directly in parse nodes and only initializes a lists in case of ambiguities. This variant only impacts the parse forest component.
Early results show a decrease of parsing time on lexical input of 16% by the improved parse forest representation, 12% by Elkhound and 20% with both improvements combined.
Sun 22 Oct
|10:30 - 11:00|
|11:00 - 11:30|
|11:30 - 12:00|