Chapter 4 Top-Down Parsing

Chapter 4 Top-Down Parsing Dr. Frank Lee

Parsing and Parsers • Once we have described the syntax of our programming language using a context-free grammar, the next step is to determine if a string of tokens returned from the lexical analyzer could be derived from that context-free grammar • Determining if a sequence of tokens is syntactically correct is called parsing • Two main strategies: • top-down parsing: from the initial symbol and working down (modern parsers: e.g. JavaCC) • bottom-up parsing: from the leaves and working up (earlier parsers: e.g. yacc)

4.1 Recursive Descent Parsers • Top-down parsers are usually implemented as a mutual resursive suite of functions that descend through a parse tree for the string, and as such are called “recursive descent parsers” (RDP). • Recursive descent parsers fall into a class of parsers known as LL(k) parsers • LL(k) stands for Left-to-right, Leftmost-derivation, k-symbol lookahead parsers • We first examine LL(1) parsers – LL parsers with one symbol lookahead • See Fig.4.1 (a CFG, p49) and its RDP in Fig. 4.2 (p50) • To build the RDP, at first, we need to create the “First” and “Follow” sets of the non-terminals in the CFG.

4.1.1 First Sets and Follow Sets For the following context-free grammar (Fig. 4.4, p52)($ is an end-of-file marker, ε means empty string) T = {e, f, g, h, i} N = {S’, S, A, B, C, D} R = (0) S’ S$ (1) S  AB | Cf (2) A  ef | ε (3) B  hg (4) C  DD | fi (5) D  g Start Symbol = S’

4.1.1 First Sets and Follow Sets (cont.) • First Sets • The First set of a non-terminal A is the set of all terminals that can begin a string derived from A • If the empty string ε can be derived from A, then ε is also in the First set of A • For instance, consider again the context-free grammar in Figure 4.4 (p52 or previous slide) • In this grammar, the set of all strings derivable from the non-terminal S’ are {efhg, fif, ggf, hg} • Thus, the First(S’) = {e,f,g,h}, where e,f,g and h are the first terminal of each string in the above terminal set, respectively • Similarly, we can derive the First sets of S, A, B, C and D as in Figure 4.5 (p52)

4.1.1 First Sets and Follow Sets (cont.) • Follow Sets • For each non-terminal in a grammar, we can also create a Follow set • The Follow set for a non-terminal A in a grammar is the set of all terminals that could appear right after A in a derivation • Take another look at the CFG in Fig. 4.4 (p52), what terminals can follow A in a derivation? • Consider the derivation S’  S$  AB$  Ahg$, since h follows A in this derivation, h is in the Follow set of A. Note: $ is the end-of-file marker. • What about the non-terminal D? Consider the partial derivation: S’  S$  Cf$  DDf$  Dgf$ • Since both f and g follow a D in this derivation, f and g are in the Follow set of D • Fig. 4.6 (p53) lists the Follow sets of all non-terminals in the CFG in Fig. 4.4

4.1.2 Finding First and Follow Sets • To calculate the First set of a non-terminal A, we need to calculate the First set of a string of terminals and non-terminals, since a non-terminal can derive strings which contain terminals and non-terminals • See the two algorithms on page 54 • We now use the algorithms above to find the First set of each non-terminal in the CFG in Fig. 4.4 (p52) as below:

4.1.2.1 Finding First Sets • Assume the CFG in Fig. 4.4 (p52) is G. • At beginning, for each non-terminal A in G, set First(A) = { }, an empty set (0) S’  S$. Add { } to First(S’) ={ } (no change) (1) S  AB. Add { } to First(S) = { } (no change) (2) S  C. Add { } to First(S) = { } (no change) (3) A  ef. Add e to First(A) = {e} (4) A  ε. Add ε to First(A) = {e, ε} (5) B  hg. Add h to First(B) = {h} (6) C  DD. Add { } to First(C) = { } (no change) (7) C  fi. Add f to First(C) = {f} (8) D  g. Add g to First(D) = {g} • Note: Since there were 5 changes, we need another iteration

4.1.2.1 Finding First Sets (0) S’  S$. Add { } to First(S’) = { } (no change) (1) S  AB. Add e,h to First(S) = {e, h} (2) S  C. Add f to First(S) = {e, h, f} (3) A  ef. Add e to First(A) = {e, ε} (no change) (4) A  ε. Add ε to First(A) = {e, ε} (no change) (5) B  hg. Add h to First(B) ={h} (no change) (6) C  DD. Add g to First(C) = {f, g} (7) C  fi. Add f to First(C) = {f, g} (no change) (8) D  g. Add g to First(D) ={g} (no change) • Note: Since there were 3 changes, we need another iteration

4.1.2.1 Finding First Sets (0) S’  S$. Add e,f,h to First(S’) = {e, f, h} (1) S  AB. Add e,h to First(S) = {e, f, h} (no change) (2) S  C. Add f,g to First(S) = {e, h, f, g} (3) A  ef. Add e to First(A) = {e, ε} (no change) (4) A  ε. Add ε to First(A) = {e, ε} (no change) (5) B  hg. Add h to First(B) = {h} (no change) (6) C  DD. Add g to First(C) = {f, g} (no change) (7) C  fi. Add f to First(C) = {f, g} (no change) (8) D  g. Add g to First(D) = {g} (no change) • Note: Since there were 2 changes, we need another iteration

4.1.2.1 Finding First Sets (0) S’  S$. Add e,f,g,h to First(S’) = {e, h, f, g} (1) S  AB. Add e,h to First(S) = {e, f, h} (no change) (2) S  C. Add f,g to First(S) = {e, h, f, g} (no change) (3) A  ef. Add e to First(A) = {e, ε} (no change) (4) A  ε. Add ε to First(A) = {e, ε} (no change) (5) B  hg. Add h to First(B) = {h} (no change) (6) C  DD. Add g to First(C) = {f, g} (no change) (7) C  fi. Add f to First(C) = {f, g} (no change) (8) D  g. Add g to First(D) = {g} (no change) • Note: Since there was 1 change, we need another iteration

4.1.2.1 Finding First Sets (0) S’  S$. Add e,f,g,h to First(S’) (no change) (1) S  AB. Add e,f, h to First(S) (no change) (2) S  C. Add f,g to First(S) (no change) (3) A  ef. Add e to First(A) (no change) (4) A  ε. Add ε to First(A) (no change) (5) B  hg. Add h to First(B) (no change) (6) C  DD. Add g to First(C) (no change) (7) C  fi. Add f to First(C) (no change) (8) D  g. Add g to First(D) (no change) • Since there were no changes, we stop • Note, if we examine the rules in a different order (such as 8,7,6,5,4,3,2,1,0) in each iteration, we will still get the same result

4.1.2.2 Finding Follow Sets • If the grammar contains the rule: S  Aa, then a is in the Follow set of A, since it appears immediately after A • If the grammar contains the rules: S AB B  a │ b then both a and b are in the Follow set of A, Why? • Consider the following two partial derivations: S  AB  Aa S  AB  Ab So both a and b are in the Follow set of A. • If the grammar contains the rules: S  ABC B  a │ b │ ε C  c │ d then a, b, c and d are all in the Follow set of A. Why?

4.1.2.2 Finding Follow Sets • Consider the grammar: S  Ax A  C then x is in the Follow sets of A and C. Why? • Consider the another grammar: S  Ax A  CD D  ε then x is in the Follow sets of A, C and D. Why? • We now use the algorithms on page 56 to find the Follow set of each non-terminal in the CFG in Fig. 4.4 as below: • Assume the CFG in Fig. 4.4 is G. At beginning, for each non-terminal A in G, set Follow(A) = { }, an empty set.

4.1.2.2 Finding Follow Sets • We can calculate the Follow sets from the First sets by using the recursive algorithm on page 56. • For example, consider the following CFG, for which we calculate the First sets for all non-terminals: Start Symbol = S’ Non-terminals = {S,T,U,V} Terminals = {a,b,c,d} Rules = (0) S’  S$ (1) S  TU (2) T  aVa (3) T  ε (4) U  bVT (5) V  Ub (6) V  d • The First sets of non-terminals are: S’ = {a,b}; S = {a,b}; T = {a, ε}; U = {b}; and V = {b,d}

Algorithm to Find Follow Sets (p56) • Find First(A) for all non-terminals A in G. • Set Follow(A) = { } for all non-terminals A in G. • For each rule A  γ in G, where γ is a string of terminals and non-terminals. For each non-terminal C in γ, if the rule is of the form A  α C β, where α and β are (possibly empty) strings of terminals and non-terminals: 3.1 If First(β) does not contain ε, then add all elements of First(β) to Follow(C). 3.2 If First(β) contains ε, then add all elements of First(β) except ε and all elements of Follow(A) to Follow(C). • Note: The First sets of non-terminals are (see previous slide): S’ = {a,b}; S = {a,b}; T = {a, ε}; U = {b}; and V= {b,d}

4.1.2.2 Finding Follow Sets • At beginning, we set Follow(S) = { } for all non-terminals S, and then go through each rule in the CFG on page 56 (0) S’  S$ Add {$} to Follow (S) = {$}. (1) S  TU Add First(U), {b}, to Follow(T) = {b} Add Follow(S), {$}, to Follow(U) = {$} (2) T  aVa Add {a} to Follow(V) = {a} (3) T  ε (no change) (4) U  bVT Add First(T), {a}, to Follow(V) = {a} Add Follow(U), {$}, to Follow(T) = {b, $} Add Follow(U), {$}, to Follow(V) = {a, $} (for T  ε) (5) V  Ub Add {b} to Follow(U) = {b, $} (6) V  d (no change) • The Follow sets of non-terminals are: S’ = { }; S = {$}; T = {b, $}; U = {b, $}; and V = {a,$} • Note: Since there were some changes, we need another iteration

4.1.2.2 Finding Follow Sets (0) S’  S$ Add $ to Follow (S) = {$}. (no change) • S  TU Add First(U), b, to Follow(T) = {b, $} (no change) Add Follow(S), {$}, to Follow(U) = {b, $} (no change) (2) T  aVa Add {a} to Follow(V) = {a, $} (no change) (3) T  ε (no change) (4) U  bVT Add First(T), a, to Follow(V) = {a, $} (no change) Add Follow(U), {$}, to Follow(T) = {b, $} (no change) Add Follow(U), {b, $}, to Follow(V)= {a, b, $} (for T  ε) (5) V  Ub Add b to Follow(U) = {b, $} (no change) (6) V  d (no change) • The Follow sets of non-terminals are: S’ = { }; S = {$}; T = {b, $}; U = {b, $}; and V = {a, b, $} • Note: Since there was 1 change, we need another iteration

4.1.2.2 Finding Follow Sets (0) S’  S$ Add $ to Follow S = {$}.. (no change) • S  TU Add First(U), b, to Follow(T ) = {b, $} (no change) Add Follow(S), {$}, to Follow(U) = {b, $} (no change) (2) T  aVa Add a to Follow(V) = {a, $} (no change) (3) T  ε (no change) (4) U  bVT Add First(T), a, to Follow(V) = {a, $} (no change) Add Follow(U), {$}, to Follow(T) = {b, $} (no change) Add Follow(U), {b, $}, to Follow(V) = {a, b, $} (no change) (5) V  Ub Add b to Follow(U) = {b, $} (no change) (6) V  d (no change) • The Follow sets of non-terminals are: S’ = { }; S = {$}; T = {b, $}; U = {b, $}; and V = {a, b, $} • Note: Since there were no changes, we stop.

4.1.3 LL(1) Parse Tables ($) • Once we have First and Follow sets, we can create a Parse Table • A parse table is a blueprint for the creation of a recursive descent parser (RDP) • The rows in the parse table are labeled with non-terminals and the columns are labeled with terminals • Each entry in the parse table is either empty or contain a grammar rule • The rule locates in row S,column a of a parse table tells us which rule to apply when we are trying to parse the non-terminal S, and the next symbol in the input is an a • For instance, for the grammar in Fig. 4.1 (p49), the parse table is:

4.1.3 LL(1) Parse Tables • Find the First sets and Follow sets of all non-terminals. • Place each rule of the form S  γ in row S in each column in First(γ), where γ is the First set of terminals and non-terminals. • Place each rule of the form S  γin row S in each column in Follow(S), where First(γ) contains ε. • Self-practice #1: Find the First and Follow sets (step by step) for the CFG in Figure 4.1 (p49), then use the CFG and two sets to create the parse table as on page 57.

4.1.3 LL(1) Parse Tables • Consider again the CFG in Fig. 4.4. (p52) We have the First and Follow sets of each non-terminal:

4.1.3 LL(1) Parse Tables • We can now create the parse table for this grammar as below:

4.1.4 Creating LL(1) Parsers -- Example • We can crate a sample LL(1) parser by using the CFG in Fig. 4.7 to generate the First set, Follow set, and parse table as on page 59. • Once we have the parse table, creating a suite of recursive functions is trivial. For instance, the function to parse an S based on the above parse table is shown on page 60. • Self-Practice #2: Use the CFG in Figure 4.7 to generate the First set and Follow set (step by step), and create the parse table as on page 59.

4.2 Grammars That Are Not LL(1) • If we can build an LL(1) parse table for a grammar that has no duplicate entries, then we say that grammar is LL(1) • Unfortunately, not all grammars are LL(1). For instance, the grammar in Fig. 4.8 comes up with First and Follow sets, and the parse table on page 60 • The parse table includes only one non-terminal E, but it has 5 entries in the id column. Hence, the grammar in Fig. 4.8 is ambiguous and we can not create an LL(1) parser for it • If we wish to be able to parse the grammar in Fig. 4.8, we need to create an equivalent grammar that is not ambiguous

4.2.1 Removing Ambiguity • There are four ways in which ambiguity can creep into (get into) a CFG for a programming language: • Defining expressions: the straightforward definition of expressions will often lead to ambiguity, such as Fig. 4.8. • Defining complex variables: complex variables, such as instance variables in classes, fields in records or structs, array subscripts and pointer references, can also lead to ambiguity • Overlap between specific and general cases: For example, in Fig. 4.9 (p61), the terminal id has several leftmost derivations (and hence several parse trees): ET, ETid, ETFid. • Nesting statements: For instance, the nesting “if e then if e then a else a” statement on page 62 has two parse trees • There are some languages that are inherently ambiguous • There is no algorithm that will always remove ambiguity from a context-free grammar

4.2.2 Removing Left Recursion • An unambiguous grammar may still not be LL(1) • A rule Sα (where S is a non-terminal and α is a string of terminals and non-terminals) is left-recursive if the first symbol in α is S • The rules (1), (2), (4) and (5) in Fig. 4.10 (p63) are left-recursive • No left-recursive grammar is LL(1)

4.2.2 Removing Left Recursion • For example: S Sα S  β • Consider the following partial derivations: S  S α  S α α  S α α α  β α α α • Using EBNF notation, we have: S  β(α)* • Using CFG notations, we have: S  βA A  αA A  ε • We have removed the left-recursion in the above example!!

4.2.2 Removing Left Recursion • In general, the set of rules of the form: S  Sα1; S  Sα2 ; S  Sα3 ; ..... ;S  Sαn S  β1 ; S  β2 ; S  β3 ; ….. ; S  βn • Can be rewritten as: S  BA B  β1│β2│β3│…..│βn A  α1A│α2A│α3A│.....│αnA A  ε

4.2.2 Removing Left Recursion • Let’s take a closer look at the expression grammar: E  E + T E  E – T E  T • Using the above transformation, we get the following CFG, which has no left-recursion: E  TE’ E’  +TE’ E’  -TE’ E’  ε • Using EBNG notations, we have: E  T((+E)│(-E))*

4.2.3 Left Factoring • Even if a CFG is unambiguous and has no left-recursion, it still may not be LL(1). • Consider the following two Fortran do statements: do do var = initial, final var = initial, final, inc loop body loop body end do end do

4.2.3 Left Factoring • We can describe the Fortran do statement with the following context free grammar fragment: S  do LS L  id = exp, exp L  id = exp, exp, exp • This CFG is not LL(1). Why? Because there are two rules for L. We can not tell which rule to use by looking only at the first symbol L. • We can fix this problem by left-factoring the similar section of the rule as follows: S  do LS L  id = exp, exp L’ L’  , exp L’  ε • Using EBNF notations, the Fortran do statement can also be written as follows: S  do LS L  id = exp, exp (, exp)? // single rule for L

4.2.3 Left Factoring • In general, if we have the following context-free grammar, where α and βk stand for strings of terminals and non-terminals: S  αβ1 ; S  αβ2 ; S  αβ3 ; ... ; S  αβn • We could left-factor it to get the CFG: S  αB; B  β1 ; B  β2 ; B  β3 ; … ; B  βn • Using EBNF notations to get: S  α(β1│β2│β3│…│βn)

4.3 LL(K) Parsers ($) • LL(1) parser needs to decide which rule to apply after looking at only one token • If more than one single token is requires to determine which rule to apply, then the grammar is not LL(1) • For instance, consider the following simple CFG: Terminals = {a, b, c} Non-terminals = {S} Rules = (1) S  abc (2) S  acb Start symbol = S • This grammar is not LL(1), since the LL(1) parse table has duplicate entries:

4.3 LL(K) Parsers • We could left-factor the grammar to make it LL(1) • We also could modify our parser so that it examines the first two elements in the string to determine which rule to apply • The resulting parse table would be much larger as follows:

4.3 LL(K) Parsers • An LL(k) parser examines the first k symbols in the input before determining which rule to apply • In order to create an LL(k) parser, we need to generalize the definitions of First and Follow sets • Our definitions of generalized First and Follow sets will use the concept of k-prefix • Definition 9 k-prefix (p66): The k-prefix of a string of terminals w is a string consisting of the first k terminals in w. If │w│≤ k, then the k-prefix of w is w. • For example; given the terminals {a, b, c, d}, the 3-prefix of the set {abcd, abcc, abdd, ab, ε} is the set {abc, abd, ab, ε}

4.3 LL(K) Parsers • Definition 10 Firstk: The Firstkset of a non-terminal S is the k-prefix of the set of all strings of terminals derivable from S. The Firstkset of a string of terminals and non-terminal γ is the k-prefix of all strings of terminals derivable from γ. • Definition 11 Followk: The Followkset of a non-terminal S is the k-prefix of the set of all strings of terminals that follow S in a partial derivation.

4.3 LL(K) Parsers Algorithm to calculate Firstk for non-terminals • For each non-terminal S in G, set Firstk(S) = { } • For each rule S  γ in G, add all elements of k-prefix(Firstk(γ)) to Firstk(S) • If any changes were made in step 2, go back to step 2 and repeat

4.3 LL(K) Parsers Algorithm to calculate Firstk for a string of terminals and non-terminals: • For any terminal a, Firstk(a) = {a} • For any string of terminals and non-terminals γ = γ1γ2γ3…γn, Firstk(γ) = k-prefix( Firstk(γ1) ○ Firstk(γ2) ○ Firstk(γ3) ○ … Firstk(γn))

4.3 LL(K) Parsers Algorithm to calculate Followk for non-terminals • Calculate Firstk(S) for all non-terminals S in G • Set Followk = { } for all non-terminals S in G • For each rule S  γin G For each non-terminal S1 in γ where γ = α S1β, add [k-prefix(Firstk(β) ○ Followk(S))] to Followk(S1). If Followk(S) = { }, add [k-prefix(Firstk(β))] to Followk(S1). 4. If any changes were made in step 3, go back to step 3 and repeat

4.3 LL(K) Parsers Algorithm to create an LL(k) parse table • For each rule of the form S  γ, place the rule in row S, in all columns labeled with Firstk(γ) ○ Followk(S)

4.3.1 Local Lookahead in LL(k) Parsers • If a CFG has t terminals and n non-terminals, then the number of entries in the LL(k) parse table can be n x tk. This makes for some pretty big parse tables • Rather than trying to create an LL(k) parser by creating a complete LL(k) parse table, we can crate an LL(1) parse table, which may have duplicate entries • When creating the parser from this parse table, we will use a local lookahead greater than 1 to disambiguate multiple parse table entries • For instance, consider the following CFG fragment for simple Java. Where S generates statements and E generates expressions (Note: GETS means “=“) S  <IDENTIFIER> <GETS> E <SEMICOLON> S  <IDENTIFIER> <IDENTIFIER> <SEMICOLON>

4.3.1 Local Lookahead in LL(k) Parsers • The parse table for this grammar will have a duplicate entry in row S, column <IDENTIFIER> • We can leave this duplicate entry in the parse table, and we can add an extra check when we build the parser • See the ParserS() program on pages 67-68 • JavaCC allows for both local (for the rules of one non-terminal) and global (for the rules of all non-terminals) lookahead greater than 1.

4.4 JavaCC – A LL(k) Parser Generator • JavaCC can be used to build lexical analyzers as well as recursive descent parsers • JavaCC takes the EBNF grammar for a language as input, and creates a suite of mutually recursive methods that implement a LL(k) parser for that language

4.4.1 JavaCC File Format Revisited • JavaCC .jj files have the following format: options { /* Code to set various option flags */ } PARSER_BEGIN(foo) public class foo { /* Extra parser method definitions go here, /* often a main program which derives the parser. */ } PARSER_END(foo) TOKEN_MGR_DECLS: { /* Declarations used by the lexical analyzer */ } /* Token rules and actions */ /* JavaCC rules and actions */

4.4.2 JavaCC Rules • JavaCC uses EBNF rules to describe the grammar of that language to be parsed • Each JavaCC rule has the form: void nonTerminalName(); { /* Java Declarations */} { /* Rule Declarations */} • The “Java Declarations” block will be used for building abstract syntax trees, we will leave this block blank for now • The “Rule Declarations” block defines the right-hand side of the rules that we are writing • Non-terminals in these rules will represent function calls, and that will be followed by (). Terminal names will be enclosed in < and >.

4.4.2 JavaCC Rules • For instance: S while (E) S S  V = E; • Would be represented by the JavaCC rule: void statement (); { } /* the Java Declarations block is blank */ { <WHILE> <LPAREN> expresssion() <RPAREN> statement() │ variable() <GETS> expression() <SEMICOLON> } • Note: JavaCC uses the uppercase letters for terminals and lowercase letters for non-terminals • All terminals are inside < and > • Non-terminals represent method calls in the generated parser. Hence, the () after each non-terminal in the rule

4.4.2 JavaCC Rules • Consider the sample prefix expressions in the table on page 69, a CFG for this language is in Fig. 4.11 (p70) • The rule for non-terminal expression() could also be written as on page 70, and • A JavaCC .jj file for this language is in Fig. 4.12 (p71) • When we run prefix.jj on JavaCC, the following files will be created: • Utility files: PaseException.java and SimpleCharStream.java • Token management files: Token.java, TokenMgrError.java, prefixTokenManager.java and prefixConstants.java • Parser implementation: prefix.java

4.4.2 JavaCC Rules • The class prefix, defined in prefix.java, contains the code that implements the parser • A public method is defined for each non-terminal, which will parse any string of tokens that can be derived from that non-terminal • A main program that calls the generated parser is in Fig. 4.3 (p72) • We could also include the main method in prefix.java, by placing its definition within the “public class prefix” section of prefix.jj • Another parser for infix expressions can be found in Fig. 4.14 (73) • We can also mix and match anonymous tokens with standard named tokens. A version of infix.jj that uses some anonymous tokens can be found in Fig. 4.15 (p74)

4.4.4 Using LOOKAHEAD Directives • By default, JavaCC creates a LL(1) parser for the grammar described in the .jj file. • What if the grammar is not LL(1)? • Consider the following JavaCC rule: void S(): { } { “a” “b” “c” │”a” “d” “c” } • The above rule can not appear in an LL(1) grammar, but it can appear in an LL(2) grammar • JavaCC not only generates LL(1) parsers, but also generates LL(k) parsers • We can set the global LOOKAHEAD to 2 instead of 1 by adding the line “LOOKAHEAD = 2; “ to the options section of the .jj file. But this will make the generated parser much larger

Chapter 4 Top-Down Parsing