allstar techreport

Transcript

1 Adaptive Parsing: The Power of Dynamic Analysis LL(*) Kathleen Fisher Sam Harwell Terence Parr University of Texas at Austin Tufts University University of San Francisco [email protected] [email protected] [email protected] PEGs are unambiguous by definition but have a quirk where Abstract | rule A → a matches either ab (meaning “ ”) can never a or ab A Despite the advances made by modern parsing strategies such match ab since PEGs choose the first alternative that matches , GLR, and GLL, parsing is not a solved prob- LL(*) as PEG, a prefix of the remaining input. Nested backtracking makes de- lem. Existing approaches suffer from a number of weaknesses, bugging PEGs difficult. including difficulties supporting side-effecting embedded ac- Second, side-effecting programmer-supplied actions (muta- tions, slow and/or unpredictable performance, and counter- tors) like print statements should be avoided in any strategy that ALL(*) intuitive matching strategies. This paper introduces the continuously speculates (PEG) or supports multiple interpreta- parsing strategy that combines the simplicity, efficiency, and tions of the input (GLL and GLR) because such actions may LL(k) parsers with the predictability of conventional top-down never really take place [17]. (Though DParser [24] supports power of a GLR-like mechanism to make parsing decisions. “final” actions when the programmer is certain a reduction is The critical innovation is to move grammar analysis to parse- part of an unambiguous final parse.) Without side effects, ac- handle any non-left-recursive context- time, which lets ALL(*) tions must buffer data for all interpretations in immutable data 4 ) in theory but consistently per- n ( ALL(*) O free grammar. is structures or provide undo actions. The former mechanism is forms linearly on grammars used in practice, outperforming limited by memory size and the latter is not always easy or pos- general strategies such as GLL and GLR by orders of magni- sible. The typical approach to avoiding mutators is to construct tude. ANTLR 4 generates ALL(*) parsers and supports direct a parse tree for post-parse processing, but such artifacts funda- left-recursion through grammar rewriting. Widespread ANTLR mentally limit parsing to input files whose trees fit in memory. 4 use (5000 downloads/month in 2013) provides evidence that Parsers that build parse trees cannot analyze large data files or ALL(*) is effective for a wide variety of applications. infinite streams, such as network traffic, unless they can be pro- 1. Introduction cessed in logical chunks. Third, our experiments (Section 7) show that GLL and GLR Computer language parsing is still not a solved problem in can be slow and unpredictable in time and space. Their com- practice, despite the sophistication of modern parsing strategies 3 p +1 p where n ( O and ) ) n ( O plexities are, respectively, is the and long history of academic study. When machine resources length of the longest production in the grammar [14]. (GLR is were scarce, it made sense to force programmers to contort 3 n ( O because Kipps [15] gave such an al- typically quoted as ) their grammars to fit the constraints of deterministic LALR ( k ) 1 gorithm, albeit with a constant so high as to be impractical.) In or As machine resources grew, re- ( k ) parser generators. LL theory, general parsers should handle deterministic grammars searchers developed more powerful, but more costly, nondeter- in near-linear time. In practice, we found GLL and GLR to be ministic parsing strategies following both “bottom-up” ( - LR ̃135x slower than on a corpus of 12,920 Java 6 library ALL(*) -style) approaches. Strategies in- LL style) and “top-down” ( source files (123M) and 6 orders of magnitude slower on a sin- LL(*) clude GLR [26], Parser Expression Grammar (PEG) [9], gle 3.2M Java file, respectively. [20] from ANTLR 3, and recently, GLL [25], a fully general LL(*) addresses these weaknesses by providing a mostly de- top-down strategy. terministic parsing strategy that uses regular expressions, rep- Although these newer strategies are much easier to use than resented as (DFA), to potentially deterministic finite automata LL parser generators, they suffer from a LALR ( ) k ( k and ) k - examine the entire remaining input rather than the fixed variety of weaknesses. First, nondeterministic parsers some- LL(*) LL(k) sequences of . Using DFA for lookahead limits times have unanticipated behavior. GLL and GLR return multi- decisions to distinguishing alternatives with regular lookahead ple parse trees (forests) for ambiguous grammars because they languages, even though lookahead languages (set of all pos- were designed to handle natural language grammars, which are sible remaining input phrases) are often context-free. But the often intentionally ambiguous. For computer languages, am- grammar condition is statically main problem is that the LL(*) biguity is almost always an error. One can certainly walk the undecidable and grammar analysis sometimes fails to find reg- constructed parse forest to disambiguate it, but that approach ular expressions that distinguish between alternative produc- costs extra time, space, and machinery for the uncommon case. tions. ANTLR 3’s static analysis detects and avoids potentially- 1 in the way that deterministic finite automata deterministic We use the term undecidable situations, failing over to backtracking parsing de- ) differ from nondeterministic finite automata ( NFA ): The next symbol(s) ( DFA cisions instead. This gives LL(*) the same a | ab quirk as PEGs uniquely determine action. 1 2014/3/24

2 for such decisions. Backtracking decisions that choose the first ALL(*) parsers handle the task of matching terminals and matching alternative also cannot detect obvious ambiguities expanding nonterminals with the simplicity of LL but have 4 α is some sequence of grammar sym- A → | α where α such n theoretical time complexity (Theorem 6.3) because in ) ( O . bols that makes non- LL(*) α | α the worst-case, the parser must make a prediction at each input symbol and each prediction must examine the entire remaining 2 4 ( O . is ) n ( O input; examining an input symbol can cost ) n 1.1 Dynamic grammar analysis in line with the complexity of GLR. In Section 7, we show LL(*) In this paper, we introduce Adaptive , or ALL(*) , parsers parsers for common languages are empirically that ALL(*) that combine the simplicity of deterministic top-down parsers efficient and exhibit linear behavior in practice. with the power of a GLR-like mechanism to make parsing de- ALL(*) The advantages of stem from moving grammar anal- cisions. Specifically, LL parsing suspends at each prediction ysis to parse time, but this choice places an additional bur- decision point (nonterminal) and then resumes once the pre- den on grammar functional testing. As with all dynamic ap- diction mechanism has chosen the appropriate production to proaches, programmers must cover as many grammar position expand. The critical innovation is to move grammar analysis to and input sequence combinations as possible to find grammar parse-time; no static grammar analysis is needed. This choice ambiguities. Standard source code coverage tools can help pro- grammar anal- lets us avoid the undecidability of static LL(*) grammers measure grammar coverage for ALL(*) parsers. High ysis and lets us generate correct parsers (Theorem 6.1) for any coverage in the generated code corresponds to high grammar non-left-recursive context-free grammar (CFG). While static coverage. analysis must consider all possible input sequences, dynamic The ALL(*) algorithm is the foundation of the ANTLR 4 analysis need only consider the finite collection of input se- ). ANTLR 4 LL(*) parser generator (ANTLR 3 is based upon quences actually seen. was released in January 2013 and gets about 5000 download- prediction mechanism is to ALL(*) The idea behind the s/month (source, binary, or ANTLRworks2 development envi- launch subparsers at a decision point, one per alternative pro- ronment, counting non-robot entries in web logs with unique duction. The subparsers operate in pseudo-parallel to explore IP addresses to get a lower bound.) Such activity provides evi- all possible paths. Subparsers die off as their paths fail to match dence that ALL(*) is useful and usable. the remaining input. The subparsers advance through the input The remainder of this paper is organized as follows. We be- in lockstep so analysis can identify a sole survivor at the min- gin by introducing the ANTLR 4 parser generator (Section 2) imum lookahead depth that uniquely predicts a production. If parsing strategy (Section 3). Next, and discussing the ALL(*) multiple subparsers coalesce together or reach the end of file, predicated grammars we define augmented transition , their the predictor announces an ambiguity and resolves it in favor representation, and lookahead DFA (Section 4). Then, network of the lowest production number associated with a surviving grammar analysis and present the pars- ALL(*) we describe subparser. (Productions are numbered to express precedence as ing algorithm itself (Section 5). Finally, we support our claims an automatic means of resolving ambiguities like PEGs; Bi- regarding ALL(*) correctness (Section 6) and efficiency (Sec- son also resolves conflicts by choosing the production specified tion 7) and examine related work (Section 8). Appendix A has semantic predicates first.) Programmers can also embed [22] to theorems, Appendix B discusses algo- proofs for key ALL(*) choose between ambiguous interpretations. rithm pragmatics, Appendix C has left-recursion elimination ALL(*) parsers memoize analysis results, incrementally and details. dynamically building up a cache of DFA that map lookahead phrases to predicted productions. (We use the term analysis in 2. ANTLR 4 analysis yields lookahead DFA like static the sense that ALL(*) analysis.) The parser can make future predictions at the LL(*) ANTLR 4 accepts as input any context-free grammar that does same parser decision and lookahead phrase quickly by consult- 2 From the gram- not contain indirect or hidden left-recursion. ing the cache. Unfamiliar input phrases trigger the grammar mar, ANTLR 4 generates a recursive-descent parser that uses analysis mechanism, simultaneously predicting an alternative an ALL(*) production prediction function (Section 3). ANTLR and updating the DFA. DFA are suitable for recording predic- currently generates parsers in Java or C#. ANTLR 4 gram- tion results, despite the fact that the lookahead language at a mars use -like syntax with extended BNF (EBNF) oper- yacc given decision typically forms a context-free language. Dy- ators such as Kleene star ( ) and token literals in single quotes. * namic analysis only needs to consider the finite context-free Grammars contain both lexical and syntactic rules in a com- language subsets encountered during a parse and any finite set bined specification for convenience. ANTLR 4 generates both is regular. a lexer and a parser from the combined specification. By using To avoid the exponential nature of nondeterministic sub- individual characters as input symbols, ANTLR 4 grammars (GSS) [25] to graph-structured stack parsers, prediction uses a ALL(*) languages can be scannerless and composable because avoid redundant computations. GLR uses essentially the same are closed under union (Theorem 6.2), providing the benefits of strategy except that only predicts productions with such ALL(*) subparsers whereas GLR actually parses with them. Conse- 2 Indirectly left-recursive rules call themselves through another rule; e.g., A → quently, GLR must push terminals onto the GSS but ALL(*) B B → A . Hidden left-recursion occurs when an empty production exposes , left recursion; e.g., . → BA , B →  A does not. 2 2014/3/24

3 grammar Ex; // generates class ExParser modularity described by Grimm [10]. (We will henceforth refer // action defines ExParser member: enum_is_keyword to ANTLR 4 as ANTLR and explicitly mark earlier versions.) @members { boolean enum_is_keyword = true; } Programmers can embed side-effecting actions ( ), mutators stat: expr ’=’ expr ’;’ // production 1 written in the host language of the parser, in the grammar. The | expr ’;’ // production 2 actions have access to the current state of the parser. The parser ; ignores mutators during speculation to prevent actions from expr: expr ’*’ expr “launching missiles” speculatively. Actions typically extract | expr ’+’ expr information from the input stream and create data structures. | expr ’(’ expr ’)’ // f(x) ANTLR also supports semantic predicates , which are side- | id effect free Boolean-valued expressions written in the host lan- ; !enum_is_keyword } ? ’enum’ ; { id : ID | guage that determine the semantic viability of a particular ID : [A-Za-z]+ ; // match id with upper, lowercase production. Semantic predicates that evaluate to false during \ r \ n]+ -> skip ; // ignore whitespace t \ WS : [ the parse render the surrounding production nonviable, dy- namically altering the language generated by the grammar at Figure 1. Ex Sample left-recursive ANTLR 4 predicated-grammar 3 Predicates significantly increase the strength of parse-time. because these forms are much less common and removing all a parsing strategy because predicates can examine the parse left recursion can lead to exponentially-big transformed gram- stack and surrounding input context to provide an informal mars. For example, the C11 language specification grammar context-sensitive parsing capability. Semantic actions and pred- contains lots of direct left-recursion but no indirect or hidden icates typically work together to alter the parse based upon recursion. See Appendix 2.2 for more details. previously-discovered information. For example, a C grammar could have embedded actions to define type symbols from con- ALL(*) 2.3 Lexical analysis with structs, like typedef int i32; , and predicates to distinguish for lexing that fully matches ANTLR uses a variation of ALL(*) type names from other identifiers in subsequent definitions like tokens instead of just predicting productions like ALL(*) . i32 x; parsers do. After warm-up, the lexer will have built a DFA sim- ilar to what regular-expression based tools such as lex would 2.1 Sample grammar lexers are ALL(*) create statically. The key difference is that Figure 1 illustrates ANTLRs yacc-like metalanguage by giv- predicated context-free grammars not just regular expressions ing the grammar for a simple programming language with as- so they can recognize context-free tokens such as nested com- signment and expression statements terminated by semicolons. ments and can gate tokens in and out according to semantic There are two grammar features that render this grammar non- is fast enough context. This design is possible because ALL(*) LL(*) and, hence, unacceptable to ANTLR 3. First, rule expr to handle lexing as well as parsing. is left recursive. ANTLR 4 automatically rewrites the rule to ALL(*) is also suitable for scannerless parsing because of its be non-left-recursive and unambiguous, as described in Sec- recognition power, which comes in handy for context-sensitive stat have tion 2.2. Second, the alternative productions of rule lexical problems like merging C and SQL languages. Such a a common recursive prefix ( expr ), which is sufficient to ren- union has no clear lexical sentinels demarcating lexical regions: stat LL(*) undecidable from an perspective. ANTLR 3 der int next = select ID from users where name=’Raj’+1; would detect recursion on production left edges and fail over to int from = 1, select = 2; a backtracking decision at runtime. int x = select * from; allows or disal- id in rule ? } keyword is !enum { Predicate lows enum as a valid identifier according to the predicate at the in [19] for a proof of concept. See grammar code/extras/CSQL moment of prediction. When the predicate is false, the parser as the id as just id : ID ; disallowing enum as an id treats parsing ALL(*) 3. Introduction to enum as a separate token from ID . This exam- lexer matches In this section, we explain the ideas and intuitions behind ple demonstrates how predicates allow a single grammar to de- ALL(*) parsing. Section 5 will then present the algorithm more scribe subsets or variations of the same language. formally. The strength of a top-down parsing strategy is related to how the strategy chooses which alternative production to 2.2 Left-recursion removal LL(*) LL(k) expand for the current nonterminal. Unlike and ALL(*) The parsing strategy itself does not support left-recursion, ALL(*) parsers, parsers always choose the first alternative that but ANTLR supports direct left-recursion through grammar leads to a valid parse. All non-left-recursive grammars are rewriting prior to parser generation. Direct left-recursion cov- therefore . ALL(*) ers the most common cases, such as arithmetic expression pro- Instead of relying on static grammar analysis, an ALL(*) ductions, like E E . → id , and C declarators. We made an engi- parser adapts to the input sentences presented to it at parse- neering decision not to support indirect or hidden left-recursion time. The parser analyzes the current decision point (nonter- minal with multiple productions) using a GLR-like mechanism 3 syntactic predicates to disambiguate Previous versions of ANTLR supported to explore all possible decision paths with respect to the cur- cases where static grammar analysis failed; this facility is not needed in ALL(*) ’s dynamic analysis. ANTLR4 because of rent “call” stack of in-process nonterminals and the remaining 3 2014/3/24

4 / / void stat() { s t a t p a r s e a c c o r d i n g t o r u l e ID = s D :1 1 call stack )) { switch (adaptivePredict("stat", 0 ( p r o d u c t i o n 1 p r e d i c t / / case 1 : ID = ID ) ; s D :1 :2 0 1 expr (); match(’=’); expr (); match(’;’); break; (a) After x=y; (b) After f(x); and x=y; p r o d u c t i o n 2 case 2 : p r e d i c t / / expr (); match(’;’); break; Prediction DFA for decision Figure 4. stat } } adap- When parsing reaches a decision for the first time, in grammar Ex Recursive-descent code for Figure 2. stat initializes the lookahead DFA for that decision by tivePredict . D D is the set of ATN subparser creating a DFA start state, 0 0 ɛ '=' expr expr ';' ɛ configurations reachable without consuming an input symbol stat p starting at each production left edge. For example, construction ɛ ɛ ';' expr for nonterminal of D in Figure 3 would first add ATN stat 0 q and p, 1 , []) and ( q, 2 , []) where p configurations q are ATN ( Ex in grammar stat ATN for ANTLR rule Figure 3. is [] states corresponding to production 1 and 2’s left edges and is the start symbol). stat the empty subparser call stack (if . The parser incrementally and dynamically input on-demand Analysis next computes a new DFA state indicating where builds a lookahead DFA per decision that records a mapping ATN simulation could reach after consuming the first looka- from lookahead sequence to predicted production number. If head symbol and then connects the two DFA states with an the DFA constructed to date matches the current lookahead, edge labeled with that symbol. Analysis continues, adding new the parser can skip analysis and immediately expand the pre- DFA states, until all ATN configurations in a newly-created dicted alternative. Experiments in Section 7 show that ALL(*) . Function ) − ,i, ( DFA state predict the same production: − parsers usually get DFA cache hits and that DFA are critical to adaptivePredict marks that state as an accept state and returns performance. to the parser with that production number. Figure 4a shows the differs from deterministic top-down meth- ALL(*) Because stat lookahead DFA for decision has an- adaptivePredict after ods only in the prediction mechanism, we can construct con- x=y; . The DFA does not look beyond = alyzed input sentence parsers but with an important ventional recursive-descent LL expr because ’s produc- is sufficient to uniquely distinguish = adap- parsers call a special prediction function, twist. ALL(*) means “predict production 1.”) tions. (Notation :1 tivePredict , that analyzes the grammar to construct lookahead finds an existing DFA adaptivePredict In the typical case, DFA instead of simply comparing the lookahead to a statically- for a particular decision. The goal is to find or build a path computed token set. Function takes a nonter- adaptivePredict through the DFA to an accept state. If adaptivePredict reaches minal and parser call stack as parameters and returns the pre- a (non-accept) DFA state without an edge for the current looka- dicted production number or throws an exception if there is no head symbol, it reverts to ATN simulation to extend the DFA stat from the example in viable production. For example, rule (without rewinding the input). For example, to analyze a second Section 2.1 yields a parsing procedure similar to Figure 2. finds an input phrase for adaptivePredict , stat , such as f(x); prediction has a structure similar to the well-known ALL(*) D and jumps to without ATN existing ID edge from the s 0 1 NFA-to-DFA subset construction algorithm. The goal is to dis- simulation. There is no existing edge from s for the left paren- 1 cover the set of states the parser could reach after having seen thesis so analysis simulates the ATN to complete a path to an some or all of the remaining input relative to the current de- accept state, which predicts the second production, as shown DFA state is the cision. As in subset construction, an ALL(*) in Figure 4b. Note that because sequence ID(ID) predicts both set of parser configurations possible after matching the input productions, analysis continues until the DFA has edges for the leading to that state. Instead of an NFA, however, ALL(*) sim- ; = and symbols. ulates the actions of an augmented recursive transition net- If ATN simulation computes a new target state that already work (ATN) [27] representation of the grammar since ATNs exists in the DFA, simulation adds a new edge targeting the ex- closely mirror grammar structure. (ATNs look just like syn- isting state and switches back to DFA simulation mode starting tax diagrams that can have actions and semantic predicates.) at that state. Targeting existing states is how cycles can appear ’s static analysis also operates on an ATN for the same LL(*) in the DFA. Extending the DFA to handle unfamiliar phrases reason. Figure 3 shows the ATN submachine for rule stat . empirically decreases the likelihood of future ATN simulation, ATN configuration An represents the execution state of a thereby increasing parsing speed (Section 7). subparser and tracks the ATN state, predicted production num- 4 Configu- . ) p,i,γ ber, and ATN subparser call stack: tuple ( 3.1 Predictions sensitive to the call stack rations include production numbers so prediction can identify Parsers cannot always rely upon lookahead DFA to make which production matches the current lookahead. Unlike static correct decisions. To handle all non-left-recursive grammars, incrementally builds DFA considering ALL(*) analysis, LL(*) prediction must occasionally consider the parser call ALL(*) just the lookahead sequences it has seen instead of all possible γ stack available at the start of prediction (denoted in Sec- 0 sequences. tion 5). To illustrate the need for stack-sensitive predictions, 4 consider that predictions made while recognizing a Java method i does not exist in the machine configurations of GLL, GLR, or Component Earley [8]. definition might depend on whether the method was defined 4 2014/3/24

5 within an interface or class definition. (Java interface methods LL . Nonetheless, the second ( LL ) stage must remain to ensure cannot have bodies.) Here is a simplified grammar that exhibits correctness. : A a stack-sensitive decision in nonterminal 4. Predicated grammars, ATNs, and DFA Aba A → b |  → yC B S → | Aa C → xB To formalize ALL(*) parsing, we first need to recall some back- Without the parser stack, no amount of lookahead can uniquely ground material, specifically, the formal definitions of predicate ’s productions. Lookahead A distinguish between predicts ba grammars, ATNs, and Lookahead DFA. A → b when B invokes A but predicts A →  when C invokes A . If prediction ignores the parser call stack, there is 4.1 Predicated grammars . ba a prediction conflict upon To formalize ALL(*) parsing, we first need to formally define Parsers that ignore the parser call stack for prediction are the predicated grammars from which they are derived. A pred- SLL ) parsers. The recursive-descent parsers ( Strong LL called G icated grammar has elements: ) M , Π N,T,P,S, = ( programmers build by hand are in the SLL class. By conven- • is the set of nonterminals (rule names) N as SLL tion, the literature refers to LL but we distinguish the • T is the set of terminals (tokens) is needed to handle all grammars. The terms since “real” LL • P is the set of productions SLL ( k ) for any above grammar is LL k , though (2) but not • is the start symbol S ∈ N SLL duplicating A (2) for each call site renders the grammar . • is a set of side-effect-free semantic predicates Π Creating a different lookahead DFA for each possible parser • is a set of actions (mutators) M call stack is not feasible since the number of stack permutations is exponential in the stack depth. Instead, we take advantage of ALL(*) Predicated LL(*) grammars differ from those of [20] the fact that most decisions are not stack-sensitive and build only in that grammars do not need or support syntactic ALL(*) lookahead DFA ignoring the parser call stack. If SLL ATN predicates. Predicated grammars in the formal sections of this simulation finds a prediction conflict (Section 5.3), it cannot be paper use the notation shown in Figure 5. The derivation rules sure if the lookahead phrase is ambiguous or stack-sensitive. in Figure 6 define the meaning of a predicated grammar. To must re-examine the lookahead adaptivePredict In this case, support semantic predicates and mutators, the rules reference optimized . This hybrid or mode γ using the parser stack LL 0 , which abstracts user state during parsing. The judg- S state ′ improves performance by caching stack-insensitive prediction ⇒ ) ,α S ( may be read: “In machine state ) ,β ment form S ( results in lookahead DFA when possible while retaining full , grammar sequence S reduces in one step to modified state α ′ ′ ∗ mode applies stack-sensitive prediction power. Optimized LL ⇒ and grammar sequence β .” The judgment ) ,β S S ( S ( ) ,α , described next, two-stage parsing on a per-decision basis, but denotes repeated applications of the one-step reduction rule. LL can often avoid simulation completely. (We henceforth use These reduction rules specify a leftmost derivation. A produc- to indicate stack-insensitive parsing and to indicate SLL LL is true of π is viable only if π tion with a semantic predicate i i stack-sensitive.) the current state S . Finally, an action production uses the spec- ified mutator μ to update the state. i α Formally, the language generated by grammar sequence in 3.2 Two-stage parsing ALL(*) ′ ∗ ) ,w and the lan- S ( } ⇒ ) ,α S ( | w { ) = ,α S ( L is S user state ∗ . Since we have found that LL SLL is weaker but faster than ) = ( ⇒ ) ,S ) S | w { } ,G ,w S ( L is G guage of grammar S ( 0 0 most decisions are SLL in practice, it makes sense to attempt S for initial user state w is a prefix of u can be empty). If S ( 0 0 only mode,” which is stage one parsing entire inputs in “ SLL is or equal to there ALL(*) iff L . Language w  u , we write w parsing algorithm. If, however, of the two-stage ALL(*) SLL grammar for ALL(*) L exists an . Theoretically, the language SLL weak- mode finds a syntax error, it might have found an is recursively enumerable because each mutator ) G ( class of L ness or a real syntax error, so we have to retry the entire input could be a Turing machine. In reality, grammar writers do not LL mode, which is stage two. This counter- using optimized use this generality so it is standard practice to consider the lan- intuitive strategy, which potentially parses entire inputs twice, guage class to be the context-sensitive languages instead. The can dramatically increase speed over the single-stage optimized class is context-sensitive rather than context-free as predicates LL mode stage. For example, two-stage parsing with the Java can examine the call stack and terminals to the left and right. grammar (Section 7) is 8x faster than one-stage optimized LL This formalism has various syntactic restrictions not present mode to parse a 123M corpus. The two-stage strategy relies on in actual ANTLR grammars, for example, forcing mutators either behaves like the fact that SLL LL or gets a syntax er- into their own rules and disallowing the common Extended ∗ + ror (Theorem 6.5). For invalid sentences, there is no derivation closures. We can α and α BNF (EBNF) notation such as for the input regardless of how the parser chooses productions. make these restrictions without loss of generality because any chooses productions as LL For valid sentences, SLL would or grammar in the general form can be translated into this more LL picks a production that ultimately leads to a syntax error ( restricted form. finds that choice nonviable). Even in the presence of ambigu- 4.2 Resolving ambiguity would. For example, LL often resolves conflicts as SLL ities, SLL mode cor- despite a few ambiguities in our Java grammar, An ambiguous grammar is one in which the same input se- rectly parses all inputs we have tried without failing over to quence can be recognized in multiple ways. The rules in Fig- 5 2014/3/24

6 Input Grammar Element A Nonterminal Resulting ATN Transitions N ∈    ′ Terminal ∈ a,b,c,d T −→ α p −→ −→ A → α p p i i A,i A A π   i X ∈ T Production element ∪ N ( ) ′ −→ α p −→ A } ? α p π →{ −→ p i i A A,i i A ∗ Sequence of grammar symbols X ∈ α,β,δ μ  i ′ −→ p −→ p μ A } p →{ ∗ A,i i A A u,v,w,x,y Sequence of terminals T ∈   ′ −→ p −→ p → A  p A,i i A A Empty string  $ End of file “symbol” α X ...X X = X X X i m 2 1 m 1 2 p ... p −−→ p −−→ −−→ 1 0 m Π ∈ π Predicate in host language N ∈ X for ∪ T, j = 1 ..m j μ Action in host language ∈M Figure 7. Predicated Grammar to ATN transformation λ ∈ ( Reduction label ∪ Π ∪M ) N ~ λ Sequence of reduction labels ..λ λ = 1 n A c A a p p p p p p Production Rules: ε ε 5 2 1 6 ε ε S,1 A,1 th p p p' p' S A ε ε α → A A context-free production of i S A ε i A ε d b p p p p p th S,2 4 3 7 A,2 } production predicated on semantics π α A →{ ? i i i th μ A →{ } i production with mutator i P = { , aA → S b Figure 8. → ATN for G | Ac with | } Ad A Predicated Grammar Notation Figure 5. is α p , and targets Q ∈ , created from the left edge of p A i A,i ′ α → A . p with an edge in ∆ targets . The last state created from α i Prod A ⇒ ,uAδ ) ( S S ) ( ,uαδ A are like function calls. They trans- p Nonterminal edges −→ q π ( S ) A μ } →{ α ? } π →{ A fer control of the ATN to ’s submachine, pushing return state A Action Sem S S ,uαδ ) ,uAδ ( S ,uAδ ) ⇒ ( μ ( S ) ,uδ ) ⇒ ( ( ) q q after reaching onto a state call stack so it can continue from ′ ′ ′ ∗ ′′ ′ ′ ( ⇒ ) ,α S ( ⇒ ) ,β ,α S S ( , ) ( ,α ) S . Figure 8 gives the ATN ’s submachine, the stop state for A p A Closure ′′ ∗ ,α ,β ) ) ⇒ ( ( S S for a simple grammar. The language matched by the ATN is the same as the language of the original grammar. Predicated Grammar Leftmost Derivation Rules Figure 6. 4.4 Lookahead DFA ure 6 do not preclude ambiguity. However, for a practical pro- graming language parser, each input should correspond to a ALL(*) parsers record prediction results obtained from ATN unique parse. To that end, ALL(*) uses the order of the pro- simulation with lookahead DFA , which are DFA augmented ductions in a rule to resolve ambiguities in favor of the pro- with accept states that yield predicted production numbers. duction with the lowest number. This strategy is a concise way There is one accept state per production of a decision. for programmers to resolve ambiguities automatically and re- Definition 4.1. Lookahead DFA are DFA with augmented solves the well-known if-then-else ambiguity in the usual way accept states that yield predicted production numbers. For by associating the else with the most recent if . PEGs and Bison = ( G predicated grammar = M , DFA ) M , Π N,T,P,S, parsers have the same resolution policy. ,D Q, Σ , ∆ ,F ) where: ( 0 To resolve ambiguities that depend on the current state S , • is the set of states Q programmers can insert semantic predicates but must make • is the edge alphabet T Σ = them mutually exclusive for all potentially ambiguous input se- • ∆ Q × Σ → Q is the transition function mapping quences to render such productions unambiguous. Statically, • is the start state Q ∈ D 0 mutual exclusion cannot be enforced because predicates are • Q i ∈ F } per prod. f final states, one ,f f { = ,...,f n 2 1 i written in a Turing-complete language. At parse-time, how- ALL(*) evaluates predicates and dynamically reports in- ever, to state p from state ∆ A transition in q on symbol a ∈ Σ has a a ′ ′ put phrases for which multiple, predicated productions are vi- q p −→ → q p implies q = q the form and we require . able. If the programmer fails to satisfy mutual exclusivity, uses production order to resolve the ambiguity. ALL(*) 5. parsing algorithm ALL(*) With the definitions of grammars, ATNs, and lookahead DFA 4.3 Augmented transition networks formalized, we can present the key functions of the ALL(*) , the corre- Given predicated grammar G = ( ) Π , M N,T,P,S, parsing algorithm. This section starts with a summary of the = ( has elements [20]: ,E,F ) sponding ATN M ∆ , Q, Σ G functions and how they fit together then discusses a critical • is the set of states Q graph data structure before presenting the functions them- • ∪M Σ is the edge alphabet N ∪ T ∪ Π selves. We finish with an example of how the algorithm works. • ) → Q ∆ is the transition relation mapping Q × (Σ ∪  parse Parsing begins with function that behaves like a con- • = E ∈ Q | { p A ∈ N } is set of submachine entry states A ALL(*) parse function except that ventional top-down LL(k) ′ • = { p ∈ Q | A F N } is set of submachine final states ∈ A parsers predict productions with a special function called adap- ATNs resemble the syntax diagrams used to document pro- tivePredict , instead of the usual “switch on next k token types” gramming languages, with an ATN submachine for each non- mechanism. Function adaptivePredict simulates an ATN rep- Q terminal. Figure 7 shows how to construct the set of states resentation of the original predicated grammar to choose an ∆ from grammar productions. The start state for and edges A | α production to expand for decision point A → α . α ... | i n 1 6 2014/3/24

7 Conceptually, prediction is a function of the current parser call simulation pops q from their stacks. Prediction can treat those w , and user state if A has predi- , remaining input stack S γ subparsers as a single subparser by merging stacks. We merge 0 r when possible (Sec- γ cates. For efficiency, prediction ignores ) p,i,γ ( stacks for all configurations in a DFA state of the form 0 1 w tion 3.1) and uses the minimum lookahead from . and ( ) , forming a general configuration p,i,γ ( with Γ) p,i, r 2 To avoid repeating ATN simulations for the same input Γ = (GSS) [25] graph-structured stack where means γ ] ] γ 1 2 and nonterminal, assembles DFA that memo- adaptivePredict , a special stack graph merge. used Γ can be the empty stack [] # ize input-to-predicted-production mappings, one DFA per non- SLL for prediction (addressed shortly), an individual stack, or D terminal. Recall that each DFA state, , is the set of ATN a graph of stack nodes. Merging individual stacks into a GSS configurations possible after matching the lookahead symbols reduces the potential size from exponential to linear complexity startState leading to that state. Function adaptivePredict calls (Theorem 6.4). To represent a GSS, we use an immutable graph D to create initial DFA state, SLLpredict to begin , and then data structure with maximal sharing of nodes. Here are two 0 simulation. at the bottom of the γ examples that share the parser stack 0 SLLpredict adds paths to the lookahead DFA that Function stack: q match some or all of w . Func- through repeated calls to target p q r ′ ′ ′ γ Γ q = Γ q ] γ pγ qγ = ] 0 0 0 0 from current state D D computes DFA target state tion target Γ Γ γ 0 using move and closure operations similar to those found in γ 0 finds all ATN configura- subset construction . Function move In the functions that follow, all additions to configuration sets, finds tions reachable on the current input symbol and closure such as with operator +=, implicitly merge stacks. all configurations reachable without traversing a terminal edge. There is a special case related to the stack condition at the closure The primary difference from subset construction is that Γ must distinguish between an empty stack start of prediction. simulates the call and return of ATN submachines associated and no stack information. For LL prediction, the initial ATN with nonterminals. simulation stack is the current parser call stack γ . The initial 0 If SLLpredict simulation finds a conflict (Section 5.3), SLL stack is only empty, = [] , when the decision entry rule is γ 0 to retry prediction, this LLpredict rewinds the input and calls prediction, on the other SLL the start symbol. Stack-insensitive time considering γ SLLpre- is similar to LLpredict . Function 0 hand, ignores the parser call stack and uses an initial stack of dict but does not update a nonterminal’s DFA because DFA , indicating no stack information. This distinction is impor- # must be stack-insensitive to be applicable in all stack contexts. tant when computing the (Function 7) of configurations closure Conflicts within ATN configuration sets discovered by LLpre- representing submachine stop states. Without parser stack in- represent ambiguities. Both prediction functions use dict get- formation, a subparser that returns from decision entry rule A ConflictSetsPerLoc to detect conflicts, which are configurations must consider all possible invocation sites; i.e., closure sees representing the same parser location but different productions. ′ configuration p ( . #) , − , A To avoid failing over to SLLpredict unnecessarily, LLpredict The empty stack [] is treated like any other node for LL to see if a potentially non-conflicting getProdSetsPerState uses , Γ ] [] prediction: } yields the graph equivalent of set , Γ { [] DFA path remains when reports a con- getConflictSetsPerLoc meaning that both Γ and the empty stack are possible. Pushing on the chance SLLpredict flict. If so, it is worth continuing with yields [] onto p state must leave the p because popping p not [] p that more lookahead will resolve the conflict without recourse SLL # = # ] Γ prediction, for any [] empty stack symbol. For to full LL parsing. acts like a wildcard and represents the set because Γ graph # Before describing these functions in detail, we review a of all stacks. The wildcard therefore contains any Γ . Pushing fundamental graph data structure that they use to efficiently onto yields # state p p . # manage multiple call stacks GLL and GLR. a la ALL(*) parsing functions 5.2 5.1 Graph-structured call stacks functions, which we have ALL(*) We can now present the key highlighted in boxes and interspersed within the text of this prediction would be a ALL(*) The simplest way to implement section. Our discussion follows a top-down order and assumes classic backtracking approach, launching a subparser for each that the ATN corresponding to grammar G , the semantic state . The subparsers would consume all remaining input because α i S input are in scope for all , the DFA under construction, and backtracking subparsers do not know when to stop parsing— functions of the algorithm and that semantic predicates and they are unaware of other subparsers’ status. The independent actions can directly access S . subparsers would also lead to exponential time complexity. Function parse . The main entry point is function parse We address both issues by having the prediction subparsers (shown in Function 1), which initiates parsing at the start sym- w advance in lockstep through . Prediction terminates after r . The function begins by initializing a simu- S bol, argument when all subparsers but one die off w  u consuming prefix r γ to an empty stack and setting ATN state lation call stack or when prediction identifies a conflict. Operating in lockstep “cursor” ’s pro- S , the ATN state on the left edge of p to p also provides an opportunity for subparsers to share call stacks S,i adaptivePredict predicted by . The function duction number i thus avoiding redundant computations. ′ , the submachine stop state p loops until the cursor reaches Two subparsers at ATN state that share the same ATN p S ′ for , S . If the cursor reaches another submachine stop state, p stack top, and , will mirror each other’s behavior until qγ qγ 2 1 B 7 2014/3/24

8 parse to the input cursor, which it does by capturing the input index off the simulates a “return” by popping the return state q to . p as q start call stack and moving upon entry and rewinding to start before returning. ) adaptivePredict ( A , γ alt int returns Function 2: 0 ) parse ( S Function 1: start := input.index () ; // checkpoint input [] ; γ := ; i := adaptivePredict ( S,γ ); p := p S,i π α then → if ∃ A i i while do true LLpredict ( ; alt := A,start,γ ) ′ 0 p = if then (i.e., p is a rule stop state) p B ( input.seek ; // ) undo stream position changes start = ; then return if B (finished matching start rule S) S ′ ′ return ; alt γ γ in ; qγ = γ else let p := q ; := if @ dfa then else A t do t where p q −→ switch , D := startState ( A ); # 0 : (i.e., terminal symbol transition) case b → A ∀ ) i ( State ; } α f := { f := DFA | F i i DFA i () input.curr = then b if := Q D ∪ ; F ∪ D DFA 0 DFA error q p := input.advance () ; ; = dfa ); := ,F DFA ,D ∅ = ( ∆ T, Q , Σ DFA DFA 0 DFA A DFA else parse error; , := , D SLLpredict , start A γ alt ); ( 0 0 ; B case γ := qγ ; i := adaptivePredict ( B,γ ); p := p : B,i ) ; // input.seek ( start undo stream position changes ; case μ : S := μ ( S ) ; p := q ; alt return ; if π ( S ) case then p := q else parse error π : case  p := q ; : , D . To create DFA start state Function startState startState 0 endsw ,i,γ ) A → α (Function 3) adds configurations ( p for each A,i i and A → π evaluates to true. When called from adap- π , if α i i i parse not at a stop state, p For processes ATN transition γ needed # tivePredict , call stack argument is special symbol t −→ p because of the p . There can be only one transition from q by SLL prediction, indicating “no parser stack information.” is a terminal edge and matches t way ATNs are constructed. If γ When called from LLpredict , γ is initial parser stack . Com- 0 transitions the edge and moves parse the current input symbol, D closure puting . of the configurations completes 0 to the next symbol. If t is a nonterminal edge referencing some B , parse simulates a submachine call by pushing return state startState Function 3: A γ ( ) returns DFA , State D 0 q onto the stack and choosing the appropriate production left ; ∅ := D 0  adaptivePredict and setting the cursor ap- by calling B edge in foreach p p ∆ ∈ do −→ A A,i ATN π  i parse propriately. For action edges, updates the state according if p −→ p := p −→ := ;  π else π π then A A,i i . For predicate edges, parse to the mutator q and transitions to μ += , , if π =  or eval ( π ( ) then D p , closure ( {} i )); γ , D A,i 0 i 0 evaluates to true. During the transitions only if predicate π ; return D 0 parse, failed predicates behave like mismatched tokens. Upon Function SLLpredict . Function SLLpredict (Function 4) edge, parse moves to q  . Function parse does not explicitly an ATN simulation, incrementally SLL performs both DFA and check that parsing stops at end-of-file because applications like adding paths to the DFA. In the best case, there is already a development environments need to parse input subphrases. f to an accept state, D DFA path from w  u , for prefix i r 0 Function adaptivePredict parse . To predict a production, and some production number i . In the worst-case, ATN sim- adaptivePredict calls (Function 2), which is a function of the . The main loop in in sequence a ulation is required for all u γ . Be- decision nonterminal A and the current parser stack 0 SLLpredict finds an existing edge emanating from DFA state sim- LL cause prediction only evaluates predicates during full . It is possi- upon a or computes a new one via target D cursor LLpredict if at least one ulation, adaptivePredict delegates to target will compute a target state that already exists in ble that 5 For decisions that do not yet of the productions is predicated. ′ ′ D returns target , in which case function because the DFA, D have a DFA, adaptivePredict with start state dfa creates DFA A ′ may already have outgoing edges computed; it is inefficient D is the in preparation for to add DFA paths. SLLpredict D D 0 0 ′ D SLLpre- to discard work by replacing . At the next iteration, set of ATN configurations reachable without traversing a ter- ′ , effectively switching back to will consider edges from dict D minal edge. Function adaptivePredict also constructs the set of DFA simulation. f for each , which contains one final state final states F DFA i , is the union Q production of A. The set of DFA states, DFA SLLpredict , ) returns D int prod , start , Function 4: ( A γ 0 0 D , D . Vocabulary F Σ of , and the error state DFA error DFA 0 = () input.curr := D ; D ; a 0 is the set of grammar terminals T . For unpredicated decisions while true do a ′ ′ to obtain a SLLpredict with existing DFA, adaptivePredict calls let be DFA target D ; −→ D D ′ ′ prediction from the DFA, possibly extending the DFA through ) D,a ( ; then target D := D if @ ATN simulation in the process. Finally, since adaptivePredict ′ D = parse error; D if then error is looking ahead not parsing, it must undo any changes made ′ D if stack sensitive then γ input.seek ( start ) ; return LLpredict( A , start , ); 5 0 prediction does not incorporate predicates for clarity in this exposition, SLL ′ but in practice, ANTLR incorporates predicates into DFA accept states (Sec- D = f if ∈ F then return i; i DFA ′ tion B.2). ANTLR 3 DFA used predicated edges not predicated accept states. := D D ; a := input.next () ; 8 2014/3/24

9 ′ Once acquires a target state, SLLpredict D , it checks for er- (Section 5.3 explains ambiguity detection.) Otherwise, cursor ′ ′ rors, stack sensitivity, and completion. If target marked D as D moves to D and considers the next input symbol. simulation and SLL- stack-sensitive, prediction requires full LL γ int alt Function 6: returns ) LLpredict ( , start , A 0 ′ calls LLpredict predict D , as determined is accept state f . If i , A ( startState := D := D ); γ 0 0 target . In this case, all the configura- i returns SLLpredict , by while true do ′ ; further analysis is i tions in D predicted the same production ()) := move mv ( D,input.curr ; ′ ⋃ , the unnecessary and the algorithm can stop. For any other D ′ D := , c ); {} ( closure ′ D to D , gets the next symbol, and repeats. algorithm sets c ∈ mv ′ if parse error; then ∅ = D move Function target - closure operation, . Using a combined ′ target discovers the set of ATN configurations reachable from } = { i } then return i ; { if j | ( ,j, − ) ∈ D − move ∈ T a . Function upon a single terminal symbol D com- Γ alt and all such 1 > /* If all p, pairs predict putes the configurations reachable directly upon by traversing a production sets are same, input ambiguous. */ ′ a terminal edge: := getConflictSetsPerLoc( D altsets ); | ∀ ∈ altsets,x = y and x,y x | > 1 then if a ( move Γ) | p ( −→ q, ( p,i, Γ) ∈ D } { ) = D,a q,i, altsets := any set in x ; x at start..input.index () ; report ambiguous alts ′ ′ D . If is empty, D Those configurations and their closure form x min( return ); ′ ; ; D := D () input.advance no alternative is viable because none can match a from the current state so target returns error state D . If all config- error Function closure operation (Function 7) chases closure . The ′ target D predict the same production number i , urations in a through all , the ATN state projected p edges reachable from  ′ . If D adds edge has con- f and returns accept state D f −→ i i and also simulates the call and c from configuration parameter ′ marks as stack-sensitive. The flicting configurations, target D closure and π edges return of submachines. Function μ treats conflict could be an ambiguity or a weakness stemming from as  edges because mutators should not be executed during ’s lack of parser stack information. (Conflicts along with SLL prediction and predicates are only evaluated during start state and getConflictSetsPerLoc are described getProdSetsPerState  and edge c = ( p,i, Γ) computation. From parameter p , −→ q ′ D in Section 5.3.) The function finishes by adding state , if an to local working set ( q,i, Γ) adds C . For submachine closure ′ , is not already in the DFA, and adding edge equivalent state, D B a ′ adds the closure , q −→ . Γ) call edge ,i,q p p ( of closure B . D −→ D ′ closure adds the of p Returning from a submachine stop state B ′ Function 5: D State DFA returns ) a , D ( target c would have been of the ( q,i, Γ) configuration in which case ′ mv D,a ( ; move := ) . In general, a configuration stack ,i,q Γ) p ( form Γ is a graph B ⋃ ′ representing multiple individual stacks. Function closure must {} D , := c ); closure ( mv c ∈ simulate a return from each of the Γ stack tops. The algorithm a ′ ∆ = ∅ then if D ; D D return ; += D −→ ′ error error DFA ∈ Γ q Γ . To Γ to represent all stack tops q uses notation of ′ = | ( − ,j, − ) ∈ D if } j { i } then { SLL avoid non-termination due to right recursion and  edges a f ; return += D ; // Predict rule i ∆ −→ f DFA i i closure in subrules such as ()+ set shared among busy uses a , ′ Look for a conflict among configurations of // D ′ all closure operations used to compute the same D . ′ ; D ( getConflictSetsPerLoc ∈ alts 1 := conflict > | alts | ) : ∃ a ′ p reaches stop state closure When for decision entry rule, A ′ alts ) : D ( getProdSetsPerState ∈ alts ; = 1 | ∃ | viablealt := and LL , prediction LL predictions behave differently. SLL A if then conflict and not viablealt a pops from the parser call stack and “returns” to the state that γ 0 ′ mark D as stack sensitive; SLL ’s submachine. invoked A prediction, on the other hand, ′ ′ ′ ′ ′ ∈ D Q ; else := Q then ; D D += D = D if DFA DFA has no access to the parser call stack and must consider all a ′ ; ∆ −→ D += D DFA finds possible A invocation sites. Function closure Γ = # (and ′ ′ ′ return ; D p will have set the p startState = ) in this situation because A B # . The return behavior at the decision γ not initial stack as 0 simulation conflict, SLL . Upon Function LLpredict SLLpre- entry rule is what differentiates parsing. LL from SLL dict rewinds the input and calls LLpredict (Function 6) to get 5.3 Conflict and ambiguity detection a prediction based upon LL ATN simulation, which considers ALL(*) The notion of conflicting configurations is central to the full parser stack γ . Function is similar to SLL- LLpredict 0 ′ analysis. Conflicts trigger failover to full LL prediction dur- D D . It uses DFA state predict as the as a cursor and state transition target for consistency but does not update ’s DFA A pre- LL prediction and signal an ambiguity during SLL ing diction. A sufficient condition for a conflict between config- prediction so LL SLL prediction can continue to use the DFA. ′ ′ urations is when they differ only in the predicted alternative: continues until either D D uniquely predicts an alter- , = ∅ ′ ′ Γ) Γ) p,j, and p,i, ( ( . Detecting conflicts is aided by two func- has a conflict. If native, or D from LL simulation has a D (Function 8), collects the getConflictSetsPerLoc tions. The first, SLL conflict as did, the algorithm reports the ambiguous phrase sets of production numbers associated with all ( p, − , Γ) config- (input from start to the current index) and resolves to the mini- urations. If a p , Γ pair predicts more than a single production, a mum production number among the conflicting configurations. 9 2014/3/24

10 turned from getConflictSetsPerLoc predict more than one alter- Function 7: ) returns set C , busy ( closure c p,i, Γ) = ( native, no amount of lookahead will lead to a unique prediction. ∈ ; += else busy ; ∅ then return busy c c if γ Analysis must try again with call stack . LLpredict via 0 C := { c } ; ′ ′ (i.e., p is any stop state including p = p if then ) p A B − i return set of alts p configs. − p, , ( D from ∈ ) // For each then (i.e., stack is SLL wildcard) Γ = # if ⋃ Function 9: D ) returns set of sets getProdSetsPerState ( busy, ( p closure ( ,i, call site closure ; // #)) C += 2 ; ∅ := s B ∈ p −→ ∆ ∀ p : p ATN 2 2 1 ∈ ( p, − , − ) for prods ∪ s := s ; } ) − p,i, ( | i { := prods do D ; SLL // nonempty stack LL else or return ; s ′ Γ for do ) Γ in graph q (i.e., each stack top Γ ∈ q LL Conflicts during simulation are ambiguities and occur ′ ( “return” to C += closure ( busy, q q,i, Γ )) ; // getConflictSetsPerLoc contains when each conflict set from return ; C is reachable from D more than 1 production—every location in end edge more than a 1 production. Once multiple subparsers reach the foreach −−−→ q p do ( , all future simulation derived from Γ) , − p, Γ) will ( same , − p, do edge switch behave identically. More lookahead will not resolve the ambi- B case += C : closure busy, ( p ; ,i,q Γ)) ( B guity. Prediction could terminate at this point and report looka- π,μ, Γ)) case : C += closure ( ; busy, q,i, ( LLpredict head prefix u as ambiguous but continues until it is ; C return which sure for u productions is ambiguous. Consider conflict conflict exists. Here is a sample configuration set and the asso- { 1,2,3 } . Because both have degree greater than } 2,3 { and sets ciated set of conflict sets: one, the sets represent an ambiguity, but additional input will ′ ′ ′′ } identify whether  w is ambiguous upon { 1,2,3 u or { 2,3 } . ( { Γ) p, 1 , , ( p, 2 , Γ) , ( p, 3 , Γ) r, Γ Γ ( 2 } ) , , ) Γ , 2 p, ( , , ( p, 1 , ) r ︸ ︸ ︸ ︷︷ ︸ ︷︷ ︷︷ ︸ ︸ continues until all conflict sets that identify LLpredict Function , { { } , 2 { 1 } 2 } 2 , 3 1 x ∈ x = y and | ambiguities are equal; condition | > 1 ∀ x,y is reachable These conflicts sets indicate that location p, Γ altsets embodies this test. ′ 1, 2, 3 is reachable from Γ from productions { p, } , location To detect conflicts, the algorithm compares graph-structured ′′ { is reachable from production , and Γ r, } { productions . } 2 1, 2 stacks frequently. Technically, a conflict occurs when configu- ′ occur in the same configuration Γ) Γ p,j, ( rations p,i, and ( ) , } i p, { − get set of alts Γ from Γ) configs p, // For each ∈ D ( and at least one stack trace j = 6 i set with in common to γ ) Function 8: getConflictSetsPerLoc ( D returns set of sets ′ s := ∅ ; both . Because checking for graph intersection is ex- and Γ Γ ′ := Γ) p,i, ∈ D do for ( p, prods := { ; prods ∪ s i s ; } Γ) − ( | , pensive, the algorithm uses equality, , as a heuristic. Γ = Γ return s ; Equality is much faster because of the shared subgraphs. The graph equality algorithm can often check node identity to com- (Function 9), is getProdSetsPerState The second function, pare two entire subgraphs. In the worst case, the equality ver- similar but collects the production numbers associated with just sus subset heuristic delays conflict detection until the GSS be- getProdSetsPer- p ATN state . For the same configuration set, tween conflicting configurations are simple linear stacks where State computes these conflict sets: graph intersection is the same as graph equality. The cost of ′ ′ ′′ ( , Γ) , 2 p, ( , Γ) , 1 p, ( Γ , 2 p, ( , ) ) Γ , 1 p, ( , Γ) , 3 p, { Γ } ) , ( r, 2 , this heuristic is deeper lookahead. ︷︷ ︸ ︷︷ ︸ ︸ ︸ 2 1 2 , 3 } } { { , 5.4 Sample DFA construction A sufficient condition for failing over to LL prediction ( LL- predict ) from SLL would be when there is at least one set To illustrate algorithm behavior, consider inputs for and bd bc of conflicting configurations: getConflictSetsPerLoc returns at the grammar and ATN in Figure 8. ATN simulation for decision least one set with more than a single production number. E.g., and launches subparsers at left edge nodes p p with S 2 S, 1 S, . How- ( p,j, Γ) and Γ) p,i, ( exist in parameter D configurations []) and , ( p configurations 1 , D initial ( p , 2 , []) . Function S, 1 2 S, 0 ever, our goal is to continue using SLL prediction as long as D closure adds three more configurations to A as it “calls” 0 possible because prediction updates the lookahead DFA SLL and p p . Here is the DFA resulting from with “return” nodes 1 3 SLL prediction continues if there is at cache. To that end, (configurations added by bd and then ATN simulation upon bc least one nonconflicting configuration (when getProdSetsPer- move are bold): returns at least one set of size 1). The hope is that more State ,p ( , ) ,p 1 , p ( , ) ,p 1 , p ( , ) []) p 1 , p ( , , 1 A, 1 A 2 A, 1 S , 1 1 1 D 0 lookahead will lead to a configuration set that predicts a unique , ( ) ,p 2 , p ( , ) ,p 2 2 p p ( , ) ,p 2 , p ( , []) , , 3 A A, 2 , 3 A, 2 3 1 S production via that nonconflicting configuration. For example, b q S b is ambiguous upon a between → a | a | a the decision for p ′ 1 , []) , 1 , ,p p ( , ) p ( , ) p p , , 1 ( 7 1 1 1 A ′ D ′ ab . (Location productions 1 and 2 but is unambiguous upon 2 , ( []) , p p ( , p , 2 ,p , ) , ( p 2 , ) 3 3 3 7 A q a is the ATN state between a .) After matching input , b and p c d ′ ′ ( p ( p , 2 , []) , ( p, 3 , []) } . , the configuration set would be { 1 , []) , ′ ′ S S 2 , []) , ( p ( , , 2 , []) 1 , []) p ( p , , 1 , []) , ( p f f 2 4 2 1 S S { 3 }} . The next Function getConflictSetsPerLoc returns {{ 1 , 2 } , ′ D . From f , and After , bc prediction, the DFA has states D upon b leads to nonconflicting configuration set move-closure 0 1 ′ ′ DFA state D and pops from , closure reaches the end of A 3 , , bypassing the conflict. If all sets re- , p ( { 3 p, ( from []) } []) , S 10 2014/3/24

11 25 22.05s 767s Tests build trees when marked the Γ uniquely S . State f stacks, returning to ATN states in 800 GLR and GLL 1 20 700 is created and connected predicts production number 1. State f 2 600 ~13 487s to the DFA (shown with dashed arrow) during prediction of 15 min 500 12.16s the second phrase, bd . Function adaptivePredict first uses DFA 400 ~8 estimated 10 8.08s min 7.65s ′ 300 bd , upon b simulation to get to D D from . Before having seen 0 5.86s reparsing 5.71s 5.23s 4.73s 200 Parse time (sec) ′ 3.73s estimated 5 98s adaptivePredict edge so D d has no must use ATN simulation 100 25s d ′ 0 to add edge D −→ . f 0 2 JSGLR Rascal JavaCC Rats! Javac ANTLR3 ANTLR4 ANTLR4 DParser Elkhound ANTLR4 Elkhound SableCC 6. Theoretical results Figure 9. Comparing Java parse times for 10 tools and 8 strate- ALL(*) theorems and shows This section identifies the key gies on Java 6 Library and compiler source code (smaller is parser time complexity. See Appendix A for detailed proofs. 12,920 files, 3.6M lines, size 123M. Tool descriptors com- faster). (Correctness). The parser for non-left- ALL(*) Theorem 6.1. [ version Javac prise: “ tool name ]; ALL(*) 4.1 [ ANTLR4 ].” strategy w ∈ w iff recognizes sentence G recursive ) . G ( L 7 [handbuilt recursive-descent and precedence parser for expres- languages are closed under union. Theorem 6.2. ALL(*) sions]; JavaCC 5.0 [ LL(k) ]; tested 2005.08.22b [GLR] ( Elkhound 4 Theorem 6.3. parsing of time. ) ALL(*) n ( O symbols has n Rats! 2.3.1 [PEG]; ); SableCC 3.7 on Windows 3.5 [ ANTLR3 LL(*) ]; JSGLR 1.2 [GLR]; DParser ]; (from Spoofax) 1.1 [GLR]; LALR [ (1) A GSS has Theorem 6.4. input symbols. O ( n ) nodes for n 0.6.1 [GLL]. Tests run 10x with generous memory, average/st- Rascal Theorem 6.5. Two-stage parsing for non-left-recursive G rec- ddev computed on last 8 to avoid JIT cost. Error bars are negligible w iff ognizes sentence . ) w ∈ L ( G but show some variability due to garbage collection. To avoid a log scale, we use a separate graph for GLR, GLL parse times. 7. Empirical results We performed experiments to compare the performance of test rig running on Windows pulling from SSD. Our reported ALL(*) Java parsers with other strategies, to examine ALL(*) Elkhound times are the Windows time multiplied by that OS X throughput for a variety of other languages, to highlight the to Windows ratio. effect of the lookahead DFA cache on parsing speed, and to outperforms the other parser For this experiment, ALL(*) ALL(*) performance in practice. provide evidence of linear generators and is only about 20% slower than the handbuilt parser in the Java compiler itself. When comparing runs with 7.1 Comparing ’s speed to other parsers ALL(*) † tree construction (marked with in Figure 9), ANTLR 4 is about 4.4x faster than Elkhound, the fastest GLR tool we tested, Our first experiment compared Java parsing speed across 10 and 135x faster than GLL (Rascal). ANTLR 4’s nondeterminis- tools and 8 parsing strategies: hand-tuned recursive-descent parser was slightly faster than JavaCC’s determinis- tic ALL(*) LL(*) , , ALL(*) LL(k) with precedence parsing, LALR (1) , PEG, ’s PEG. In a sep- LL(k) tic parser and about 2x faster than Rats! GLR, and GLL. Figure 9 shows the time for each tool to parse ALL(*) arate test, we found that outperforms Rats! on its own the 12,920 source files of the Java 6 library and compiler. We PEG grammar converted to ANTLR syntax (8.77s vs 12.16s). chose Java because it was the most commonly available gram- LALR The (1) LL parser did not perform well against the mar among tools and sample Java source is plentiful. The Java tools but that could be SableCC’s implementation rather than grammars used in this experiment came directly from the as- . (The Java grammar from JavaCUP, (1) LALR a deficiency of sociated tool except for DParser and Elkhound, which did not LALR another tool, was incomplete and unable to parse (1) offer suitable Java grammars. We ported ANTLR’s Java gram- ALL(*) lookahead the corpus.) When reparsing the corpus, mar to the meta-syntax of those tools using unambiguous arith- gets cache hits at each decision and parsing is 30% faster at merge actions in metic expressions rules. We also embedded 3.73s. When reparsing with tree construction (time not shown), the Elkhound grammar to disambiguate during the parse to outperforms handbuilt Javac (4.4s vs 4.73s). Reparsing ALL(*) mimic ANTLR’s ambiguity resolution. All input files were speed matters to tools such as development environments. loaded into RAM before parsing and times reflect the average The GLR parsers we tested are up to two orders of magni- time measured over 10 complete corpus passes, skipping the tude slower at Java parsing than ALL(*) . Of the GLR tools, first two to ensure JIT compiler warm-up. For , we used ALL(*) Elkhound has the best performance primarily because it re- the two-stage parse from Section 3.2. The test machine was a LR lies on a linear stack instead of a GSS whenever pos- (1) 6 core 3.33Ghz 16G RAM Mac OS X 10.7 desktop running sible. Further, we allowed Elkhound to disambiguate during the Java 7 virtual machine. Elkhound and DParser parsers were . Elkhound uses a separate lexer, unlike the parse like ALL(*) implemented in C/C++, which does not have a garbage collec- JSGLR and DParser, which are scannerless. A possible expla- tor running concurrently. Elkhound was last updated in 2005 nation for the observed performance difference with ALL(*) is and no longer builds on Linux or OS X, but we were able to that the Java grammar we ported to Elkhound and DParser is build it on Windows 7 (4 core 2.67Ghz 24G RAM). Elkhound biased towards ALL(*) , but this objection is not well-founded. also can only read from a file so Elkhound parse times are not GLR should also benefit from highly-deterministic and unam- comparable. In an effort to control for machine speed differ- biguous grammars. GLL has the slowest speed in this test per- ences and RAM vs SSD, we computed the time ratio of our haps because Rascal’s team ported SDF’s GLR Java grammar, Java test rig on our OS X machine reading from RAM to the 11 2014/3/24

12 Tool RAM (M) Time Grammar KB/sec † 89 ms Javac 7 45,993 XML Erlang ANTLR4 201 ms 8 100 Java 24,972 C 7 206 ms JavaCC 80 17,696 JSON † ANTLR4 8 360 ms 60 DOT 16,152 ANTLR3 1048 ms 143 40 Lua 5,698 † SableCC 201 1,174 ms Lua 20 4,238 C † 22 1,365 ms Rats! Verilog2001 Verilog2001 1,994 0 No. DFA states x 1000 † 400 300 700 800 200 100 0 500 600 JSGLR 15.4 sec 1,030 751 Erlang Files parsed † 2,622 Rascal 24.9 sec Figure 11. Throughput in DFA growth rate Figure 12. ( no DFA ) ANTLR4 27 42.5 sec a Files vs number of files parsed. KByte/sec. Lexing+parsing; elkhound 3 3.35 min † parsed in disk order. all input preloaded into RAM. DParser 10.5 hours 100+ † out of mem 5390+ elkhound fications. (In our experience, grammar specifications are rarely Time and space to parse and optionally build trees for Figure 10. tuned to a particular tool or parsing strategy and are often am- 3.2M Java file. Space is median reported after GC during parse using biguous.) Later, programmers can use ANTLR’s profiling and -XX:+PrintGC option (process monitoring for C++). Times include diagnostics to improve performance, as with any programming † a Disambiguating during Building trees. lexing; all input preloaded. not task. For example, the C11 specification grammar is LL SLL estimated time the parse, no trees, . because of rule , which we altered declarationSpecifiers to be SLL in our C grammar (getting a 7x speed boost). which is not optimized for GLL (Grammar variations can af- fect performance.) Rascal is also scannerless and is currently 7.3 Effect of lookahead DFA on performance the only available GLL tool. The lookahead DFA cache is critical to performance. ALL(*) The biggest issue with general algorithms is that they are To demonstrate the cache’s effect on parsing speed, we disabled highly unpredictable in time and space, which can make them the DFA and repeated our Java experiments. Consider the 3.73s unsuitable for some commercial applications. Figure 10 sum- parse time from Figure 9 to reparse the Java corpus with pure marizes the performance of the same tools against a single cache hits. With the lookahead DFA cache disabled completely, 3.2M Java file. Elkhound took 7.65s to parse the 123M Java the parser took 12 minutes (717.6s). Figure 10 shows that dis- corpus, but took 3.35 minutes to parse the 3.2M Java file. It abling the cache increases parse time from 203ms to 42.5s on crashed (out of memory) with parse forest construction on. the 3.2M file. This performance is in line with the high cost of DParser’s time jumped from a corpus time of 98s to 10.5 hours GLL and GLR parsers that also do not reduce parser specula- on the 3.2M file. The speed of Rascal and JSGLR scale reason- tion by memoizing parsing decisions. As an intermediate value, ably well to the 3.2M file, but use 2.6G and 1G RAM, respec- clearing the DFA cache before parsing each corpus file yields a tively. In contrast, ALL(*) parses the 3.2M file in 360ms with total time of 34s instead of 12 minutes. This isolates cache use tree construction using 8M. ANTLR 3 is fast but is slower and to a single file and demonstrates that cache warm-up occurs uses more memory (due to backtracking memoization) than quickly even within a single file. ANTLR 4. DFA size increases linearly as the parser encounters new ALL(*) 7.2 performance across languages lookahead phrases. Figure 12 shows the growth in the number of DFA states as the (slowest four) parsers from Figure 11 ALL(*) Figure 11 gives the bytes-per-second throughput of encounter new files. Languages like C that have constructs with parsers for 8 languages, including Java for comparison. The LL common left-prefixes require deep lookahead in parsers to number of test files and file sizes vary greatly (according to the x; } } ... { ... struct and { struct distinguish phrases; e.g., input we could reasonably collect); smaller files yield higher f(); share a large left-prefix. In contrast, the Verilog2001 parse-time variance. parser uses very few DFA states (but runs slower due to a non- • C Derived from C11 specification; has no indirect left-recursion, rule). Similarly, after seeing the entire 123M Java corpus, SLL altered stack-sensitive rule to render (see text below): 813 SLL the Java parser uses just 31,626 DFA states, adding an average preprocessed files, 159.8M source from database. postgres of ̃2.5 states per file parsed. DFA size does, however, continue • Verilog2001 Derived from Verilog 2001 spec, removed indirect to grow as the parser encounters unfamiliar input. Programmers left-recursion: 385 files, 659k from [3] and web. • can clear the cache and will adapt to subsequent input. ALL(*) JSON Derived from spec. 4 files, 331k from twitter. • DOT: Derived from spec. 48 files 19.5M collected from web. • 7.4 Empirical parse-time complexity Lua : Derived from Lua 5.2 spec. 751 files, 123k from github. • XML Derived from spec. 1 file, 117M from XML benchmark. Given the wide range of throughput in Figure 11, one could • grammar. 500 preproc’d files, 8M. Erlang Derived from LALR(1) suspect nonlinear behavior for the slower parsers. To investi- Some of these grammars yield reasonable but much slower gate, we plotted parse time versus file size in Figure 13 and parse times compared to Java and XML but demonstrate that drew least-squares regression and LOWESS [6] data fitting programmers can convert a language specification to ANTLR’s curves. LOWESS curves are parametrically unconstrained (not meta-syntax and get a working grammar without major modi- required to be a line or any particular polynomial) and they vir- 12 2014/3/24

13 160 creases recognition strength and avoids static grammar analy- 14 140 12 385 Verilog2001 files 740 C files 120 sis undecidability issues but is undesirable because it has em- 10 100 8 80 bedded mutators issues, reduces performance, and complicates 6 60 4 40 single-step debugging. Packrat parsers (PEGs) [9] try decision 2 20 Parse time (ms) 0 0 productions in order and pick the first that succeeds. PEGs are 4 5 1 0 0 600 200 100 2 3 300 500 6 400 100 n O because they memoize partial parsing results but suffer ) ( 12 80 500 Erlang files 642 Lua files 10 ab quirk where | a from the ab is silently unmatchable. 8 60 6 To improve general parsing performance, Tomita [26] intro- 40 Regression 4 20 duced GLR, a general algorithm based upon that concep- LR(k) 2 LOWESS Parse time (ms) 0 0 LR(k) state at parse- tually forks subparsers at each conflicting 15 5 10 20 40 60 80 100 120 0 20 140 30 35 25 0 File size (KB) File size (KB) time to explore all possible paths. Tomita shows GLR to be 5x-10x faster than Earley. A key component of GLR parsing is Figure 13. . Linear regression (dashed Linear parse time vs file size the graph-structured stack (GSS) [26] that prevents parsing the line) and LOWESS unconstrained curve coincide, giving strong evi- same input twice in the same way. (GLR pushes input symbols dence of linearity. Curves computed on (bottom) 99% of file sizes but and LR states on the GSS whereas ALL(*) pushes ATN states.) zooming in to show detail in the bottom 40% of parse times. Elkhound [18] introduced hybrid GLR parsers that use a sin- decisions and a GSS when necessary to LR gle stack for all (1) tually mirrors each regression line, providing strong evidence match ambiguous portions of the input. (We found Elkhound’s that the relationship between parse time and input size is linear. parsers to be faster than those of other GLR tools.) GLL [25] The same methodology shows that the parser generated from is the analog of GLR and also uses subparsers and a GSS to LL the non- grammar (not shown) taken from the C11 speci- SLL = 1 explore all possible paths; GLL uses lookahead where k SLL fication is also linear, despite being much slower than our 3 +1 p ) and GLR is ( n ) O possible for efficiency. GLL is O ( n version. is the length of the longest grammar production. where p We have yet to see nonlinear behavior in practice but the the- 4 for deterministic ) n ( O Earley parsers scale gracefully from . Ex- n ( O parsing is ALL(*) ) oretical worst-case behavior of 3 O grammars to n ) in the worst case for ambiguous grammars ( perimental parse-time data for the following contrived worst- LR(k) but performance is not good enough for general use. , aaa aa , a case grammar exhibits quartic behavior for input , n state machines can improve the performance of such parsers (with n a ≤ symbols we could test in a reasonable 120 ..., by statically computing as much as possible. LRE [16] is one aAA $ aA | A a A S amount of time). . The gener- , → → | such example. Despite these optimizations, general algorithms ated parser requires a prediction upon each input symbol and are still very slow compared to deterministic parsers augmented closure each prediction must examine all remaining input. The with deep lookahead. operation performed for each input symbol must examine the The problem with arbitrary lookahead is that it is impossi- entire depth of the GSS, which could be size n . Finally, merg- LL ble to compute statically for many useful grammars (the - ) n ( O ing two GSS can take in our implementation, yielding 4 regular condition is undecidable.) By shifting lookahead anal- n complexity. ) ( O gains the power to handle any gram- ALL(*) ysis to parse-time, From these experiments, we conclude that shifting gram- mar without left recursion because it can launch subparsers strength is not only mar analysis to parse-time to get ALL(*) to determine which path leads to a valid parse. Unlike GLR, practical but yields extremely efficient parsers, competing with speculation stops when all remaining subparsers are associated the hand-tuned recursive-descent parser of the Java compiler. with a single alternative production, thus, computing the mini- Memoizing analysis results with DFA is critical to such per- 4 mum lookahead sequence. To get performance, records ALL(*) ( n ) theoretical complexity, ALL(*) ap- O formance. Despite a mapping from that lookahead sequence to the predicted pro- pears to be linear in practice and does not exhibit the unpre- duction using a DFA for use by subsequent decisions. The dictable performance or large memory footprint of the general context-free language subsets encountered during a parse are algorithms. lookahead languages are regular. ALL(*) finite and, therefore, [1] also performed parse-time analysis, but they et al Ancona 8. Related work LR(k) only computed fixed lookahead and did not adapt to the For decades, researchers have worked towards increasing the ALL(*) actual input as does. Perlin [23] operated on an RTN LR and LL recognition strength of efficient but non-general lookahead during the parse. = 1 k and computed ALL(*) like parsers and increasing the efficiency of general algorithms such is similar to Earley in that both are top-down and ALL(*) 3 ( n as Earley’s ) algorithm [8]. Parr [21] and Charles [4] stat- O operate on a representation of the grammar at parse-time, but . Parr and 1 ically generated LL(k) and LR(k) parsers for k > Earley is parsing not computing lookahead DFA. In that sense, Fisher’s [2] ) m ( LAR [20] and Bermudez and Schimpf’s LL(*) Earley is not performing grammar analysis. Earley also does and LR parsers augmented with cyclic statically computed LL not manage an explicit GSS during the parse. Instead, items in DFA that could examine arbitrary amounts of lookahead. These Earley states have “parent pointers” that refer to other states -regular [12] and LR -regular [7] parsers were based upon LL that, when threaded together, form a GSS. Earley’s SCANNER parsers, which have the undesirable property of being unde- operation correspond to ALL(*) ’s move function. The PREDIC- cidable in general. Introducing backtracking dramatically in- 13 2014/3/24

14 TOR operations correspond to push and pop COMPLETER and , K. The top-down parsing of expressions. Unpub- [5] C LARKE lished technical report, Dept. of Computer Science and Statistics, function. An Earley state is the ALL(*) ’s operations in closure Queen Mary College, London, June 1986. set of all parser configurations reachable at some absolute input DFA state is a set of configurations ALL(*) depth whereas an LEVELAND [6] C , W. S. Robust Locally Weighted Regression and Smoothing Scatterplots. Journal of the American Statistical reachable from a lookahead depth relative to the current deci- Association 74 (1979), 829–836. sion. Unlike the completely general algorithms, seeks ALL(*) a single parse of the input, which allows the use of an efficient OHEN , R., -Regular grammars—an exten- C ULIK [7] C , K. LR AND sion of LR k ( ) grammars. In SWAT ’71 (Washington, DC, USA, LL stack during the parse. 1971), IEEE Computer Society, pp. 153–165. Parsing strategies that continuously speculate or support ambiguity have difficulty with mutators because they are hard [8] E ARLEY , J. An efficient context-free parsing algorithm. Com- munications of the ACM 13 , 2 (1970), 94–102. to undo. A lack of mutators reduces the generality of seman- tic predicates that alter the parse as they cannot test arbitrary [9] F , B. Parsing Expression Grammars: A recognition-based ORD state computed previously during the parse. Rats! [10] sup- syntactic foundation. In POPL (2004), ACM Press, pp. 111–122. ports restricted semantic predicates and Yakker [13] supports , R. Better extensibility through modular syntax. In RIMM [10] G semantic predicates that are functions of previously-parsed ter- (2006), ACM Press, pp. 38–51. PLDI ALL(*) minals. Because does not speculate during the actual [11] H Introduction to Automata The- , J. LLMAN U AND , J., OPCROFT parse, it supports arbitrary mutators and semantic predicates. ory, Languages, and Computation . Addison-Wesley, Reading, Space considerations preclude a more detailed discussion of Massachusetts, 1979. related work here; a more detailed analysis can be found in ref- LL RAWCZYK [12] J ARZABEK , S., AND K , T. -Regular grammars. erence [20]. , 2 (1975), 31 – 37. Information Processing Letters 4 [13] J , Y., IM AND ANDELBAUM W , T., M , D. Semantics and ALKER 9. Conclusion algorithms for data-dependent grammars. In POPL (2010). parser for any CFG without in- ALL(*) ANTLR 4 generates an OHNSON , M. The computational complexity of GLR parsing. [14] J direct or hidden left-recursion. ALL(*) combines the simplicity, , M. Tomita, Ed. Kluwer, Boston, Generalized LR Parsing In efficiency, and predictability of conventional top-down LL(k) 1991, pp. 35–42. parsers with the power of a GLR-like mechanism to make pars- . Springer, 1991, pp. 43–59. , J. IPPS [15] K Generalized LR Parsing ing decisions. The critical innovation is to shift grammar anal- H CLEAN , P., AND [16] M ORSPOOL , R. N. A faster Earley parser. In ysis to parse-time, caching analysis results in lookahead DFA (1996), Springer, pp. 281–293. CC ALL(*) outperforms general for efficiency. Experiments show , S. Elkhound: A fast, practical GLR parser genera- [17] M C P EAK (Java) parsers by orders of magnitude, exhibiting linear time tor. Tech. rep., University of California, Berkeley (EECS), Dec. and space behavior for 8 languages. The speed of the ALL(*) 2002. Java parser is within 20% of the Java compiler’s hand-tuned 4 [18] M N AND , S., EAK P C , G. C. Elkhound: A fast, practical ECULA n , inline is O ( ALL(*) ) recursive-descent parser. In theory, GLR parser generator. In CC (2004), pp. 73–88. with the low polynomial bound of GLR. ANTLR is widely [19] P ARR , T. The Definitive ANTLR Reference: Building Domain- provides practical pars- ALL(*) used in practice, indicating that . The Pragmatic Programmers, 2013. ISBN Specific Languages ing power without sacrificing the flexibility and simplicity of 978-1-93435-699-9. recursive-descent parsers desired by programmers. , T., AND F ISHER , K. LL ∗ ) : The Foundation of the ( [20] P ARR (2011), pp. 425–436. ANTLR Parser Generator. In PLDI 10. Acknowledgments LL ( k ) and ARR ( k ) [21] P , T. J. Obtaining practical variants of LR We thank Elizabeth Scott, Adrian Johnstone, and Mark Johnson k -tuple . PhD thesis, Purdue by splitting the atomic 1 k > for for discussions on parsing algorithm complexity. Eelco Visser University, West Lafayette, IN, USA, 1993. and Jurgen Vinju provided code to test isolated parsers from JS- Q , R. W. Adding Semantic and Syntac- [22] P AND , T. J., ARR UONG GLR and Rascal. Etienne Gagnon generated SableCC parsers. - pred — ) k ( LL tic Predicates to (1994). CC . In ) k ( LL References , M. LR recursive transition networks for Earley and [23] P ERLIN Proceedings of the 29th Annual Meeting Tomita parsing. In NCONA AND IANUZZI , G., G ODERO , M., D , V., [1] A M ORGAVI , (1991), ACL ’91, on Association for Computational Linguistics LR ) states and tables. ACM M. Efficient construction of ( k pp. 98–105. , 1 (Jan. 1991), 150–178. Trans. Program. Lang. Syst. 13 , J. DParser: GLR parser generator, Visited Oct. 2013. LEVYAK [24] P [2] B ERMUDEZ , M. E., AND S CHIMPF , K. M. Practical arbitrary Journal of Computer and System Sci- parsing. LR lookahead [25] S COTT , E., AND J Electron. Notes , A. GLL parsing. OHNSTONE , 2 (1990). ences 41 , 7 (Sept. 2010), 177–189. Theor. Comput. Sci. 253 AND V RANESIC , Z. Fundamentals of Digital Logic , S., ROWN [3] B , M. . Kluwer [26] T OMITA Efficient Parsing for Natural Language . McGraw-Hill series in ECE. 2003. with Verilog Design Academic Publishers, 1986. HARLES [4] C , P. A Practical Method for Constructing Efficient [27] W OODS , W. A. Transition network grammars for natural lan- LALR(k) Parsers with Automatic Error Recovery . PhD thesis, guage analysis. Comm. of the ACM 13 , 10 (1970), 591–606. New York University, New York, NY, USA, 1991. 14 2014/3/24

15 ′ = = 6 a for some ay b w and all previous bx = . This w 2. A. Correctness and complexity analysis r r case reduces to the cold-start base case because there is no languages are closed under union. ALL(*) Theorem A.1. b D α −→ D edge. ATN simulation predicts and adds path 0 i ′ from w . f for  D to u 0 i r ) , Π Proof. Let predicated grammars G = M ( N ,T,P , ,S 1 1 1 1 1 1 ′ w w = vax and 3. = vby for some previously seen w with r r r ) L ( G , ) and L ) ( G describe M and G , = ( N ,T,P ,S , Π 2 2 2 2 2 1 2 2 D common prefix v and a 6 = b . DFA simulation reaches respectively. For applicability to both parsers and scannerless has an edge for . ATN from b D a for input v but not . D 0 is the set of valid T parsers, assume that the terminal space and augments the DFA, starting with α simulation predicts i ∅ N characters. Assume by renaming nonterminals if N = ∩ 1 2 D an edge on a from leading eventually to f . i necessary. Assume that the predicates and mutators of G and 1 S . Construct: G and S operate in disjoint environments, 2 1 2 : The ALL(*) parser reports a syntax error for only if case ′ ′ . Assume the opposite, that the parser successfully w / L ( G ) ∈ N ∪ ,S ,T,P N ) ∪M M , P ∪ ∪ Π G , = ( Π 2 2 2 2 1 1 1 1 parses w . That would imply that there exists an ATN configu- ′ ∗ ′ ′ ′ [] , , ,p S ( ration derivation sequence S [] ( , ) for w ,p 7→ ) ,w S | ) G ( L ∪ ) G ( L ) = . G with S ( = S L . Then, S S 2 1 1 2 through G ’s corresponding ATN. But that would require w ∈ L G ) , by Definition ?? parser reports a ALL(*) . Therefore the ( The Lemma A.1. ALL(*) parser for non-left-recursive G with w lookahead cache ALL(*) . The accuracy of the syntax error for w ∈ iff lookahead DFA deactivated recognizes sentence w is irrelevant because there is no possible path through the ATN G ( L . ) or parser. iff . There- ) G Proof. The ATN for G recognizes w ( w ∈ L Lemma A.2. LL The set of viable productions for is always a fore, we can equivalently prove that ALL(*) is a faithful imple- and ’s viable productions for a given decision SLL subset of A mentation of an ATN. Without lookahead DFA, prediction is a remaining input string . w r straightforward ATN simulator: a top-down parser that makes accurate parsing decisions using GLR-like subparsers that can Proof. If the key move-closure analysis operation does not examine the entire remaining input and ATN submachine call ′ behave for submachine A , SLL and LL p reach stop state stack. A identically and so they share the same set of viable produc- tions. Theorem A.2. (Correctness). The ALL(*) parser for non-left- closure If reaches the stop state for the decision entry rule, recursive L recognizes sentence . ) G ( w ∈ w iff G ′ ′ , − ,γ where, for ) p , there are configurations of the form ( p A A Γ convenience, the usual GSS . In γ is split into single stacks, Proof. ALL(*) parser correctly rec- Lemma A.1 shows that an γ prediction mode, γ = LL , which is either a single stack or 0 without the DFA cache. The essence of the proof w ognizes S empty if A = . In SLL mode, γ = # , signaling no stack then is to show that ’s adaptive lookahead DFA do not ALL(*) γ information. Function must consider all possible closure 0 break the parse by giving different prediction decisions than parser call stacks. Since any single stack must be contained straightforward ATN simulation. We only need to consider the within the set of all possible call stacks, closure operations LL parsing as case of unpredicated SLL ALL(*) only caches deci- consider at most the same number of paths through the ATN as sion results in this case. SLL . if case: By induction on the state of the lookahead DFA for . A The first prediction for Base case. be- any given decision A and non-left-recursive 6∈ w ) For Lemma A.3. G G ( , SLL L gins with an empty DFA and must activate ATN simulation to reports a syntax error.  choose alternative α . As ATN simulation using prefix u w r i yields proper predictions, the ALL(*) parser correctly predicts As in the Proof. of Theorem 6.1, there is no valid only if case u from a cold start and then records the mapping from : α i i ATN configuration derivation for adap- regardless of how w in the DFA. If there is a single viable alternative, i is the as- tivePredict chooses productions. sociated production number. If ATN simulation finds multiple is the minimum production number asso- viable alternatives, i Two-stage parsing for non-left-recursive Theorem A.3. rec- G ciated with alternatives from that set. . w iff w ∈ L G ( ) ognizes sentence Induction step. Assume the lookahead DFA correctly pre- seen by the parser at dicts productions for every u prefix of w r . We must show that starting with an existing DFA, ALL(*) A Proof. By Lemma A.3, SLL and LL behave the same when prefix u properly adds a path through the DFA for unfamiliar L . It remains to show that ) G 6∈ prediction either SLL w ( ′ w of . There are several cases: for input or reports a syntax error, behaves like LL ) w ∈ L ( G r ′ V be and signalling a need for the V second stage. Let LL ′ w 1. u  w for a previous . The lookahead DFA w  u and using the set of viable production numbers for A , SLL and LL r r r ′ by induction assumption. The u gives the correct answer for V respectively. By Lemma A.2, ⊆ V . There are two cases to DFA is not updated. consider: 15 2014/3/24

16 ′ 1. If V ) = min ( V Originally, we anticipated “training” a parser on a large ) , SLL and LL choose the same pro- min ( ′ succeeds for V = { 1 , 2 , 3 } and V duction. = w input corpus and then serializing the lookahead DFA to disk to SLL . E.g., ′ } = { 1 } and V , = { 1 } . 1 avoid re-computing DFA for subsequent parser runs. As shown { or 3 V ′ ′ in the Section 7, lookahead DFA construction is fast enough = then ) 6 min ( V ) 6∈V ( because LL finds min 2. If min V ( V ) that serializing and deserializing the DFA is unnecessary. would report a syntax error. E.g., SLL nonviable. ) V ( min ′ ′ and V = = { 2 , 3 } or V V { 1 , 2 } and V = { { 2 } . 1 , 2 , 3 } = B.2 Semantic predicate evaluation ′ In all possible combinations of SLL V and V LL , behaves like For clarity, the algorithm described in this paper uses pure G ∈ L ( w ) or reports a syntax error for . ATN simulation for all decisions that have semantic predicates on production left edges. In practice, ANTLR uses lookahead ) nodes for n input symbols. Theorem A.4. O ( n A GSS has DFA that track predicates in accept states to handle semantic- context-sensitive prediction. Tracking the predicates in the , there are Proof. N | |× Q and ATN states N For nonterminals DFA allows prediction to avoid expensive ATN simulation if A Q | p | −→ q ATN transitions if every every grammar position predicate evaluation during simulation predicts a unique SLL invokes every nonterminal. That limits the number of new GSS production. Semantic predicates are not common but are crit- 2 | nodes to Q | for a closure operation (which cannot transition ical to solving some context-sensitive parsing problems; e.g., performs n + 1 closures for n input terminal edges). ALL(*) predicates are used internally by ANTLR to encode operator 2 n | Q | n ( + 1) nodes or O ( symbols giving ) as Q is not a precedence when rewriting left-recursive rules. So it is worth function of the input. the extra complexity to evaluate predicates during SLL predic- tion. Consider the predicated rule from Section 2.1: 2 Examining a lookahead symbol has ) time. ( n Lemma A.4. O is keyword ? ’enum’ ; } { id : ID | !enum The second production is viable only when keyword !enum is Proof. Lookahead is a move-closure operation that computes evaluates to true. In the abstract, that means the parser would ′ as a function of the ATN configu- D new target DFA state need two lookahead DFA, one per semantic condition. In- 2 Q | configurations of rations in D . There are | Q | × m ≈ | implementation creates a DFA (via stead, ANTLR’s ALL(*) the form ( p,i, D ∈ | Q ) | ATN states and m alternative for enum SLL prediction) with edge D is an −−−→ f where f 0 2 2 is not move productions in the current decision. The cost of keyword !enum . augmented DFA accept state that tests is c a function of input size n . Closure of D computes closure ( ) if enum returns production 2 upon adaptivePredict Function ∈ D ∀ c and closure ( c ) can walk the entire GSS back to the else throws a no-viable-alternative excep- keyword is !enum 2 root (the empty stack). That gives a cost of | Q configurations | tion. 2 n ( O + 1) n ( GSS nodes (per Theorem A.4) or | Q add times ) | The algorithm described in this paper also does not sup- ′ operations to build . Adding a configuration is dominated D port semantic predicates outside of the decision entry rule. In by the graph merge, which (in our implementation) is propor- practice, analysis must evaluate all predicates reachable ALL(*) tional to the depth of the graph. The total cost for move-closure from the decision entry rule without stepping over a terminal 2 n ) . ( O is edge in the ATN. For example, the simplified ALL(*) algorithm 4 π in this paper considers only predicates and π for the pro- 1 2 parsing of Theorem A.5. ALL(*) n ) n ( O input symbols has in the following (ambiguous) grammar. S ductions of time. } →{ π S Ab ? } ? Ab |{ π 2 1 Proof. Worst case, the parser must examine all remaining in- a A →{ } ? a |{ π } ? π 4 3 put symbols during prediction for each of input symbols giv- n S Input ab matches either alternative of and, in practice, 2 lookahead operations. The cost of each lookahead ing ) n ( O and ( π or π π ANTLR evaluates “ )” to test the viability 1 3 4 2 operation is O ( n ) by Lemma A.4 giving overall parsing cost . After simulating π ’s first production not just S of S and 1 4 O ( ) . n A would be S ’s ATN submachines, the lookahead DFA for b a ′ f −→ f . Augmented accept state predicts pro- −→ D D 2 1 2 1 , , 0 B. Pragmatics ( π ductions 1 or 2 depending on semantic contexts ∧ ) π ∨ π 4 3 1 This section describes some of the practical considerations ∨ π , respectively. To keep track of semantic π ∧ ) π and ( 4 2 3 ALL(*) algorithm. associated with implementing the SLL context during simulation, ANTLR ATN configurations ,π : ( π . Element ) carries along contain extra element π Γ p,i, B.1 Reducing warm-up time semantic context and ANTLR stores predicate-to-production LL and they are easy Many decisions in a grammar are (1) pairs in the augmented DFA accept states. to identify statically. Instead of always generating “switch on B.3 Error reporting and recovery ” decisions in the recursive-descent parsers, adaptivePredict ANTLR generates “switch on token type” decisions whenever prediction can scan arbitrarily far ahead so erro- ALL(*) possible. This optimization does not affect the size of LL (1) neous lookahead sequences can be long. By default, ANTLR- the generated parser but reduces the number of lookahead DFA generated parsers print the entire sequence. To recover, parsers that the parser must compute. consume tokens until a token appears that could follow the 16 2014/3/24

17 ′ current rule. ANTLR provides hooks to override reporting and D would be the same in either case so there is no hazard as recovery strategies. long as that specific edge array is updated safely using syn- ANTLR parsers issue error messages for invalid input chronization. To encounter a contested lock, two or more ATN phrases and attempt to recover. For mismatched tokens, ANTLR simulation threads must try to add an edge to the same DFA attempts single token insertion and deletion to resynchronize. If state. the remaining input is not consistent with any production of the current nonterminal, the parser consumes tokens until it finds C. Left-recursion elimination a token that could reasonably follow the current nonterminal. ANTLR supports directly left-recursive rules by rewriting them Then the parser continues parsing as if the current nonterminal to a non-left-recursive version that also removes any ambigu- had succeeded. ANTLR improves error recovery over ANTLR ities. For example, the natural grammar for describing arith- 3 for EBNF subrules by inserting synchronization checks at the metic expression syntax is one of the most common (ambigu- start and at the “loop” continuation test to avoid prematurely ous) left-recursive rules. The following grammar supports sim- exiting the subrule. For example, consider the following class ple modulo and additive expressions. definition rule. ’ ; } ’ member+ ’ { classdef : ’class’ ID ’ → | E % E id E | E + E member : ’int’ ID ’;’ ; An extra semicolon in the member list such as int i;; int left-recursive because at least one production E is directly to abort. In- j; should not force surrounding rule classdef ⇒ ∃ begins with E ( ), which is a problem for top-down Eα E stead, the parser ignores the extra semicolon and looks for an- parsers. . To reduce cascading error messages, the parser other member Grammars meant for top-down parsers must use a cumber- issues no further messages until it correctly matches a token. some non-left-recursive equivalent instead that has a separate nonterminal for each level of operator precedence: B.4 Multi-threaded execution ∗ ′ M → Additive, lower precedence E ) M (+ Applications often require parallel execution of multiple parser ∗ ) P M → (% P Modulo, higher precedence instances, even for the same language. For example, web-based id P Primary ( id means identifier) → application servers parse multiple incoming XML or JSON data streams using multiple instances of the same parser. For The deeper the rule invocation, the higher the precedence. parser instances for a given lan- ALL(*) memory efficiency, all At parse-time, matching a single identifier, rule a l , requires guage must share lookahead DFA. The Java code that ANTLR precedence levels. invocations for l ′ generates uses a shared memory model and threads for con- E is easier to read than E , but the left-recursive version currency, which means parsers must update shared DFA in : a+b+c is ambiguous as there are two interpretations of input a thread-safe manner. Multiple threads can be simulating the and (a+b)+c . Bottom-up parser generators such as a+(b+c) DFA while other threads are adding states and edges to it. Our bison use operator precedence specifiers (e.g., ) to %left ’%’ ′ goal is thread safety, but concurrency also provides a small E resolve such ambiguities. The non-left-recursive grammar speed up for lookahead DFA construction (observed empiri- is unambiguous because it implicitly encodes precedence rules cally). according to the nonterminal nesting depth. The key to thread safety in Java while maintaining high Ideally, a parser generator would support left-recursion and synchronized throughput lies in avoiding excessive locking ( provide a way to resolve ambiguities implicitly with the gram- blocks). There are only two data structures that require locking: mar itself without resorting to external precedence specifiers. Q ∆ , the set of DFA states, and , the set of edges. Our imple- ANTLR does this by rewriting nonterminals with direct left- ′ += Q D mentation factors state addition, , into an addDFAState recursion and inserting semantic predicates to resolve ambigu- function that waits on a lock for before testing a DFA state Q ities according to the order of productions. The rewriting pro- for membership or adding a state. This is not a bottleneck as cess leads to generated parsers that mimic Clarke’s [5] tech- DFA simulation can proceed during DFA construction with- nique. out locking since it traverses edges to visit existing DFA states We chose to eliminate just direct left-recursion because gen- without examining Q . eral left-recursion elimination can result in transformed gram- Adding DFA edges to an existing state requires fine-grained mars orders of magnitude larger than the original [11] and locking, but only on that specific DFA state as our implemen- yields parse trees only loosely related to those of the origi- tation maintains an edge array for each DFA state. We allow nal grammar. ANTLR automatically constructs parse trees ap- multiple readers but a single writer. A lock on testing the edges propriate for the original left-recursive grammar so the pro- is unnecessary even if another thread is racing to set that edge. grammer is unaware of the internal restructuring. Direct left- a ′ D −→ exists, the simulation simply transitions to D If edge recursion also covers the most common grammar cases (from ′ . If simulation does not find an existing edge, it launches D long experience building grammars). This discussion focuses ′ D ATN simulation starting from and then sets D to compute on grammars for arithmetic expressions, but the transformation edge ] for D . Two threads could find a missing edge a element [ rules work just as well for other left-recursive constructs such a ′ and both launch ATN simulation, racing to add a on D −→ . D D as C declarators: →∗ D , D → D [ ] , D → D ( ) , D → id . 17 2014/3/24

18 Eliminating direct left-recursion without concern for ambi- E = 1 ..s be the α → A guity is straightforward [11]. Let for j j E E % 1 = 1 k Aβ → A non-left-recursive productions and for be ..r k ∗ 6⇒  . Re- ,β the directly left-recursive productions where α k j 2 E E + + E % 1 place those productions with: 3 3 2 ′ ′ A → | α ... | α A A 1 s ′ ′ ′ (a) a%(b+c) (a%b)+c (b) A ... A A β → β |  | | r 1 ∗ ) E + | E (% id → E and a%b+c Parse trees for Figure 14. The transformation is easier to see using EBNF: ′′∗ ′ → A A A E [0] ′ | A → α | ... α 1 s E [0] ′′ A → β | ... | β 1 r E [2] = a ∗ or just A . For example, the left- β ) | ... → ( α | | ... | α β )( s 1 r 1 E [3] + E [4] a % [2] E = b rule becomes: recursive E ∗ c c b (% E + E → E ) | id (b) a=b=c a%b+c (a) assoc a=(b=c) assoc (a%b)+c This non-left-recursive version is still ambiguous because there are two derivations for . The default ambiguity resolution a+b+c Figure 15. Nonterminal Expansion Trees for nonterminal ∗ policy chooses to match input as soon as possible, resulting in [4] E ? % } pr ≥ 3 { [3]) ( id → ] pr [ E E ? + } pr ≥ 2 |{ (a+b)+c interpretation . The difference in associativity does not matter for expres- ≥ { 3 pr Production “ ? % E [4] ” is viable when the prece- } sions using a single operator, but expressions with a mixture of dence of the modulo operation, 3, meets or exceeds parameter operators must associate operands and operators according to 3 pr has pr = 0 and, since E ≥ 0 , the . The first invocation of operator precedence. For example, the parser must recognize % E [4] ” in E [0] . parser expands “ a%b+c as (a%b)+c not a%(b+c) .The two interpretations are ≥ pr } ? fails When parsing invocation E [4] , predicate { 2 shown in Figure 14. . 4 6≥ operator is too low: + because the precedence of the 2 To choose the appropriate interpretation, the generated Consequently, operator, deferring + does not match the [4] E parser must compare the previous operator’s precedence to the to the invoking . E [0] ∗ “loop.” In E + | E (% current operator’s precedence in the ) pa- A key element of the transformation is the choice of E is the critical expansion of . It must match just E Figure 14, E rameters, in this grammar. For left-associative [3] E and [4] E E to match id and return immediately, allowing the invoking operators like % + , the right operand gets one more prece- and the + to form the parse tree in (a) as opposed to (b) . dence level than the operator itself. This guarantees that the To support such comparisons, productions get precedence E invocation of for the right operand matches only operations numbers that are the reverse of production numbers. The prece- of higher precedence. th for + 1 i − n production is original produc- i n dence of the operand gets the For right-associative operations, the E . That assigns precedence 3 to E tions of , prece- E % E → E same precedence as the current operator. Here is a variation id E + E , and precedence 1 to E → E . dence 2 to → on the expression grammar that has a right-associative assign- E needs information about Next, each nested invocation of ment operator instead of the addition operator: the operator precedence from the invoking E . The simplest , to mechanism is to pass a precedence parameter, pr E and right E | id E % E | E → = E can match only those subex- pr [ E An expansion of require: ] . pr pressions whose precedence meets or exceeds is a shorthand for the actual ANTLR = where notation right right E | syntax “ .” The interpretation of E = To enforce this, the left-recursion elimination procedure inserts a=b=c a=(b=c) . To get that as- should be right associative, ∗ E | + E ) (% predicates into the loop. Here is the transformed sociativity, the transformed rule need differ only in the right unambiguous and non-left-recursive rule: : [3] E versus [2] E operand, ∗ id ( { 3 ≥ pr } ? % E [4] |{ 2 ≥ pr } ? + E [3]) E [ pr ] → ∗ id ? = } pr ≥ 2 |{ [4] E ? % } E [ pr ] → [2]) ( { 3 ≥ pr E References to elsewhere in the grammar become E [0] E ; e.g., expansion can match an assignment, as shown in [2] E The [0] . Input a%b+c yields the parse tree S → E becomes S → E 2 (b) of Figure 15, since predicate ≥ 2 is true. of Figure 15. shown in (a) [0] E for Unary prefix and suffix operators are hardwired as right- and left-associative, respectively. Consider the following E 18 2014/3/24

19 E 5. Rewrite A references among binary, ternary, and prefix pro- E E [0] ( ductions as A [ nextpr i,assoc )] where ! E ) = == left } i + 1 : nextpr i ? { ( i,assoc assoc ! E [0] E E ! ! % E [4] E - 6. Rewrite any other A references within any production in P - E ′ ′′ (including and ) as A A [0] A ! [3] E % E E [4] E - - E [4] - - E ′′∗ ′ A 7. Rewrite the original A → ] A pr [ A rule as ! b b a a a a ′ ′′∗ In practice, ANTLR uses the EBNF form rather than A A . -a%b! --a!! (a) (b) --a!! (c) -a%b! (d) ((-(-a))!)! (-a)%(b!) Figure 16. Nonterminal call trees and parse trees for produc- % →− E | tions E ! id | E E E | with negation prefix and “not” suffix operators. →− E | E ! E E % E | id | Prefix operators are not left recursive and so they go into the first subrule whereas left-recursive suffix operators go into the predicated loop like binary operators: ] E [ pr E → ( id |− [4]) ∗ { 3 ≥ pr } ? ! |{ 2 ≥ ( } ? % E [3]) pr Figure 16 illustrates the rule invocation tree (a record of the call stacks) and associated parse trees resulting from an ANTLR- generated parser. Unary operations in contiguous productions all have the same relative precedence and are, therefore, “eval- uated” in the order specified. E.g., E E E | + must id | → − as -+a interpret . +(-a) not -(+a) or E → E . Nonconforming left-recursive productions E →  are rewritten without concern for ambiguity using the typical elimination technique. Because of the need to resolve ambiguities with predicates A parameters, and compute C.1 Left-Recursion Elimination Rules To eliminate direct left-recursion in nonterminals and resolve ambiguities, ANTLR looks for the four patterns: ) → Aα A A ( binary and ternary operator i i A → Aα ( ) suffix operator i i A ) → α prefix operator A ( i i A → α ( primary or “other” ) i i , captures the production The subscript on productions, A i number in the original grammar when needed. Hidden and indirect left-recursion results in static left-recursion errors from ′ G to G ANTLR. The transformation procedure from is: 1. Strip away directly left-recursive nonterminal references ′ 2. Collect prefix, primary productions into newly-created A 3. Collect binary, ternary, and suffix productions into newly- ′′ A created ′′ with precedence-checking seman- A 4. Prefix productions in = { ) = i ( pr where ? } pr + 1 > ) i ( pr { tic predicate } i − n 19 2014/3/24

Related documents

D2 Summit OC 19

D2 Summit OC 19

THE D2 SUMMIT - WILD CARD - ORDER OF COMPETITION TIMES FOR FRIDAY MAY 10, 2019 - ESPN WIDE WORLD OF SPORTS ® (Hp Field House) 12 Minute 12 Minute Competition Competition Division Check-In Warm-up Warm...

More info »
PO ACA 1.22

PO ACA 1.22

AMERICAN CHEERLEADERS ASSOCIATION EXHIBIT HALL FEBRUARY 9, 2019 PRACTICE MAT C MAT D TIME DIVISION TEAM NAME 8:00 AM 7:15 AM Cheer-riffic Techniques Talons L4 Small Senior Coed - D2 7:15 AM 8:04 AM L2...

More info »
The Cheerleading Worlds Performance Order

The Cheerleading Worlds Performance Order

THE 2019 CHEERLEADING WORLDS ORDER OF COMPETITION TIMES FOR MONDAY APRIL 29, 2019 - ESPN WIDE WORLD OF SPORTS ® (THE ARENA) TEAMS REPORT TO NORTHWEST PAVILION FOR REHEARSAL L5 International Open Small...

More info »
MVCheerCompetition20182019

MVCheerCompetition20182019

S ATURDAY , J ANUARY 12, 201 9 FOR THE OIN US ON J ARRIORS ’ U C HEERLEADING C OMPETITION W LTIMATE This competition offers a fun, exciting and competitive atmosphere for cheerleaders of all ages. Com...

More info »
4H program directory

4H program directory

Oakland County 4 H - Club Directory 1200 N. Telegraph Rd North Office Building 26 East Pontiac MI 48341

More info »
MF1028 Small  and Tree Fruit Cultivars

MF1028 Small and Tree Fruit Cultivars

Small- and Tree-Fruit Cultivars Fruit crops and cultivars should be carefully chosen. They Self-Fruitful Fruit Crop should be adapted to location and growing conditions. Some small fruit plants, such ...

More info »
STATE OF NORTH CAROLINA

STATE OF NORTH CAROLINA

NCUC HHG NO. 1 NORTH CAROLINA UTILITIES COMMISSION MAXIMUM RATE TARIFF NO. 1 INTRASTATE RATES AND CHARGES Applying on HOUSEHOLD GOODS as described in - 37 NCUC Rule R2 Between POINTS IN NORTH CAROLINA...

More info »
Entire List

Entire List

Massachusetts Registered Auto Body Shops Effective January 27, 2017 State law requires that auto body repair shops register and post a bond or letter of credit of at least $10,000 with GEICO the Massa...

More info »
04 29 19 DOCKET

04 29 19 DOCKET

RANCHO SANTIAGO COMMUNITY C OLLEGE DISTRICT (RSCCD) Board of Trustees (Regular meeting) Monday, April 29, 2019 2323 North Broadway, #107 Santa Ana, CA 92706 District Mission The mission of the Rancho ...

More info »
3139 235 5380 20 110 IFU PRINTING

3139 235 5380 20 110 IFU PRINTING

R R 8) Troubleshooting Thank you for subscribing to 5. Press and release the 2. Your FiOS TV remote 2. “Widgets” displays local will blink each time a new 4. The RED LED light will blink e. Once you h...

More info »
FiOS® Remote Control Setup & User Guide

FiOS® Remote Control Setup & User Guide

TM 8) Troubleshooting Thank you for subscribing to 5. Press and release the 4. The RED LED light will blink 2. “Widgets” displays local 2. Your Frontier TV remote will blink each time a new e. Once yo...

More info »
Pennsylvania Climate Action Plan

Pennsylvania Climate Action Plan

Pennsylvania Climate Action Plan Pennsylvania Climate Action Plan , 2018 (DRAFT) November 20 Prepared for: Prepared by: A cknowledgements and Disclaimer 0

More info »
lecture16.pptx

lecture16.pptx

Lecture 16: Threads William Gropp www.cs.illinois.edu/~wgropp

More info »
migrationprofileguide2012 1oct2012

migrationprofileguide2012 1oct2012

MIGRATION PROFILES : MIGRATION PROFILES : Making the Most of the Process Making the Most of the Process MIGRATION PROFILES - Making the Most of the Process 17 route des Morillons CH-1211 Geneva 19, Sw...

More info »
entropist

entropist

Why I’m not an Entropist ? Paul Syverson Naval Research Laboratory [email protected] Abstract. What does it mean to be anonymous in network communi- cations? Our central thesis is that both th...

More info »
Electrification Futures Study: End Use Electric Technology Cost and Performance Projections through 2050

Electrification Futures Study: End Use Electric Technology Cost and Performance Projections through 2050

Electrification Futures Study: End-Use Electric Technology Cost and Performance Projections through 2050 Paige Jadun, Colin McMillan, Daniel Steinberg, Matteo Muratori, Laura Vimmerstedt, and Trieu Ma...

More info »
cachetiming.dvi

cachetiming.dvi

Cache-timing attac ks on AES ? J. Bernstein Daniel Statistics, t of Mathematics, Science (M/C 249) and Departmen Computer y of Illinois ersit at Chicago The Univ Chicago, IL 60607{7045 [email protected] er...

More info »
rss urls

rss urls

RSS URLs country category rss-url http:/ / abcnews/ usheadlines US general feeds.abcnews.com/ / rss.cnn.com/ cnn_topstories.rss US general http:/ rss/ www.cbsnews.com/ / rss/ main http:/ US general la...

More info »
The H Metaphor as a guideline for vehicle automation and interaction

The H Metaphor as a guideline for vehicle automation and interaction

https://ntrs.nasa.gov/search.jsp?R=20040031835 2019-06-02T02:14:57+00:00Z NASA/TM—2003-212672 he H-Metaphor as a Guideline for T Vehicle Automation and Interaction F rank O. Flemisch University Munich...

More info »
18 05 31 preliminary opinion on privacy by design en 0

18 05 31 preliminary opinion on privacy by design en 0

8 5 Opinion /201 Preliminary Opinion on privacy by design 31 May 2018 P a g e | i

More info »