Chapter 4: Syntax Analysis Part 1: Grammar Concepts

Chapter 4: Syntax AnalysisPart 1: Grammar Concepts Prof. Steven A. Demurjian Computer Science & Engineering Department The University of Connecticut 371 Fairfield Way, Unit 2155 Storrs, CT 06269-3155 steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818 Material for course thanks to: Laurent Michel Aggelos Kiayias Robert LeBarre

Syntax Analysis - Parsing • An overview of parsing : Functions & Responsibilities • Context Free Grammars: Concepts & Terminology • Writing and Designing Grammars • Resolving Grammar Problems / Difficulties • Top-Down Parsing • Recursive Descent & Predictive LL • Bottom-Up Parsing • LR & LALR • Key Issues in Error Handling • Concluding Remarks/Looking Ahead

An Overview of Parsing • Why are Grammars to formally describe Languages Important ? • Precise, easy-to-understand representations • Compiler-writing tools can take grammar and generate a compiler • allow language to be evolved (new statements, changes to statements, etc.) Languages are not static, but are constantly upgraded to add new features or fix “old” ones • ADA  ADA9x, C++ Adds: Templates, exceptions, How do grammars relate to parsing process ?

Parsing During Compilation errors token lexical analyzer rest of front end source program parse tree parser get next token symbol table regular expressions interm repres • also technically part or parsing • includes augmenting info on tokens in source, type checking, semantic analysis • uses a grammar to check structure of tokens • produces a parse tree • syntactic errors and recovery • recognize correct syntax • report errors

Parsing Responsibilities • Syntax Error Identification / Handling • Recall typical error types: • Lexical : Misspellings • Syntactic : Omission, wrong order of tokens • Semantic : Incompatible types • Logical : Infinite loop / recursive call • Majority of error processing occurs during syntax analysis • NOTE: Not all errors are identifiable !! Which ones?

Key issues in Error Handling • Detection • Actual Detection • Finding position at which they occur • Reporting • Clear / accurate presentation • Recovery • How to skip overto continue and find later errors • Cannot impact compilation of correct programs

What are some Typical Errors ? #include<stdio.h> int f1(int v) { int i,j=0; for (i=1;i<5;i++) { j=v+f2(i) } return j; } int f2(int u) { int j; j=u+f1(u*u) return j; } int main() { int i,j=0; for (i=1;i<10;i++) { j=j+i*I; printf(“%d\n”,i); printf("%d\n",f1(j)); return 0; } As reported by MS VC++ 'f2' undefined; assuming extern returning intsyntax error : missing ';' before ‘}‘syntax error : missing ';' before ‘return‘fatal error : unexpected end of file found Which are “easy” to recover from? Which are “hard” ?

Error Recovery Strategies Panic Mode – Discard tokens until a “synchro” token is found ( end, “;”, “}”, etc. ) -- Decision of designer -- Problems: skip input miss declaration – causing more errors miss errors in skipped material -- Advantages: simple suited to 1 error per statement Phase Level – Local correction on input -- “,” ”;” – Delete “,” – insert “;” -- Also decision of designer -- Not suited to all situations -- Used in conjunction with panic mode to allow less input to be skipped

Error Recovery Strategies – (2) Error Productions: -- Augment grammar with rules -- Augment grammar used for parser construction / generation -- Add a rule for := in C assignment statements Report error but continue compile -- Self correction + diagnostic messages -- What are other rules ? (supported by yacc) Global Correction: -- Adding / deleting / replacing symbols is chancy – may do many changes ! -- Algorithms available to minimize changes costly - key issues What do you think of each approach? What approach do compilers you’ve used take ?

Motivating Grammars Reg. Lang. CFLs • Regular Expressions •  Basis of lexical analysis •  Represent regular languages • Context Free Grammars •  Basis of parsing •  Represent language constructs •  Characterize context free languages EXAMPLE: anbn , n  1 : Is it regular ?

Context Free Languages Context Free Languages Regular Languages Context Free Languages Reg. Languages Token Structure Sentence Structure Parser Generation Scanner Generation

Context Free Grammars : Concepts & Terminology Definition: A Context Free Grammar, CFG, is described by T, NT, S, PR, where: T: Terminals / tokens of the language NT: Non-terminals to denote sets of strings generatable by the grammar & in the language S: Start symbol, SNT, which defines all strings of the language PR: Production rules to indicate how T and NT are combines to generate valid strings of the language. PR: NT  (T | NT)* Like a Regular Expression / DFA / NFA, a Context Free Grammar is a mathematical model !

Context Free Grammars : A First Look assign_stmt id := expr ; expr  term operator term term id term real term integer operator + operator - What do “BLUE” symbols represent? What do “BLACK” symbols represent? Derivation: A sequence of grammar rule applications and substitutions that transform a starting non-term into a collection of terminals / tokens. Simply stated: Grammars / production rules allow us to “rewrite” and “identify” correct syntax.

How is Grammar Used ? Given the rules on the previous slide, suppose id := real + int; is input. Is it syntactically correct? How do we know? expr is represented as: expr term operator term Is this accurate / complete? expr  expr operator term expr  term How does this affect the derivations which are possible?

Derivation and Parse Tree Let’s derive: id := id + real – integer ; What’s its parse tree ?

Derivation Example • Let’s derive: id := id + real – integer ; assign_stmt assign_stmt → id := expr ; → id := expr ; expr → expr operator term → id := expr operator term; expr → expr operator term → id := expr operator term operator term; expr → term → id := term operator term operator term; term → id → id := id operator term operator term; operator → + → id := id + term operator term; term → real → id := id + real operator term; operator → - → id := id + real - term; term → integer → id := id + real - integer; assign_stmt → id := expr ; expr → expr operator term expr → term term → id term → real term → integer operator → + operator → -

Example Grammar expr  expr op expr expr ( expr ) expr - expr expr id op + op - op * op / op  Black : NT Blue : T expr : S 9 Production rules To simplify / standardize notation, we offer a synopsis of terminology.

Example Grammar - Terminology Terminals: a,b,c,+,-,punc,0,1,…,9, blue strings Non Terminals: A,B,C,S, black strings T or NT: X,Y,Z Strings of Terminals: u,v,…,z in T* Strings of T / NT:  ,  in ( T  NT)* Alternatives of production rules: A 1; A 2; …; A k;  A  1 | 2 | … | 1 First NT on LHS of 1st production rule is designated as start symbol ! E  E A E | ( E ) | -E | id A  + | - | * | / | 

Grammar Concepts A step in a derivation is zero or one action that replaces a NT with the RHS of a production rule. EXAMPLE: E  -E (the  means “derives” in one step) using the production rule: E  -E EXAMPLE: E  E A E  E * E  E * ( E ) DEFINITION:  derives in one step  derives in  one step  derives in  zero steps + * EXAMPLES:  A      if A  is a production rule 1  2 …  n  1  n ;    for all  If    and    then    * * * *

How does this relate to Languages? + Let G be a CFG with start symbol S. Then S W (where W has no non-terminals) represents the language generated by G, denoted L(G). So WL(G)  S W. + W : is a sentence of G When S  (and  may have NTs) it is called a sentential form of G. EXAMPLE: id * id is a sentence Here’s the derivation: E  E A E E * E  id * E  id * id E id * id Sentential forms *

Leftmost and Rightmost Derivations   lm rm Leftmost: Replace the leftmost non-terminal symbol E  E A E  id A E  id * E  id * id Rightmost: Replace the leftmost non-terminal symbol E  E A E  E A id  E *id  id * id lm lm lm lm rm rm rm rm Important Notes: A   If A    , what’s true about  ? If A    , what’s true about  ? Derivations: Actions to parse input can be represented pictorially in a parse tree.

Examples of LM / RM Derivations E  E A E | ( E ) | -E | id A  + | - | * | / |  A leftmost derivation of : id + id * id A rightmost derivation of : id + id * id

Derivations & Parse Tree E E E E * E E E E A A A A E E E E id * id id * E  E A E  E * E  id * E  id * id

Parse Trees and Derivations E E E E E + E  id+ E E E E E + + * + E E E E id+ E  id+ E * E id id Consider the expression grammar: E  E+E | E*E | (E) | -E | id Leftmost derivations of id + id * id E  E + E

Parse Tree & Derivations - continued id+ E * E  id+id* E id id id id id E E E E E E E E + * * + E E E E id+id* E  id+id* id

Alternative Parse Tree & Derivation E E E * E E + E id id id E  E * E  E + E * E  id + E * E  id + id * E  id + id * id WHAT’S THE ISSUE HERE ?

Example Pascal Grammar in Yacc %term T_AND T_ARRAY T_BEGIN T_CASE T_CONST T_DIV T_DO T_RANGE T_TO T_ELSE T_END T_FILE T_FOR T_FUNCTION /* T_GOTO */ T_IDENTIFIER T_IF T_IN T_INTEGER T_LABEL T_MOD T_NOT T_REAL T_OF T_OR /* T_PACKED */ T_NIL T_PROCEDURE T_PROGRAM T_RECORD T_REPEAT T_SET T_STRING T_THEN T_DOWNTO T_TYPE T_UNTIL T_VAR T_WHILE /* T_WITH */ T_COMMA T_PERIOD T_ASSIGN T_COLON T_LPAREN T_RPAREN T_LBRACK T_RBRACK T_UPARROW T_SEMI %binary T_LT T_EQ T_GT T_GE T_LE T_NE T_IN %left T_PLUS T_MINUS T_OR %left UNARYSIGN %left T_MULT T_RDIV T_DIV T_MOD T_AND %left T_NOT %%

Example Pascal Grammar in Yacc Start : Program_Header Declarations Block T_PERIOD Program_Header : T_PROGRAM T_IDENTIFIER T_LPAREN Id_List T_RPAREN T_SEMI Block : T_BEGIN Stt_List T_END Declarations : Const_Def_Part Type_Def_Part Var_Decl_Part Routine_Decl_Part Const_Def_Part : T_CONST Const_Def_List | /* epsilon */ ; Const_Def_List : Const_Def | Const_Def_List Const_Def ; Const_Def : T_IDENTIFIER T_EQ Constant T_SEMI Type_Def_Part : T_TYPE Type_Def_List | /* epsilon */ Type_Def_List : Type_Def | Type_Def_List Type_Def

Example Pascal Grammar in Yacc Type_Def : T_IDENTIFIER T_EQ Type T_SEMI Var_Decl_Part : T_VAR Var_Decl_List | /* epsilon */ Var_Decl_List : Var_Decl_List Var_Decl | Var_Decl Var_Decl : Id_List T_COLON Type T_SEMI Routine_Decl_Part : Routine_Decl_Part Routine_Decl | /* epsilon */ Routine_Decl : Routine_Head T_IDENTIFIER T_SEMI | Routine_Head Declarations Block T_SEMI Routine_Head : T_PROCEDURE T_IDENTIFIER Params T_SEMI | T_FUNCTION T_IDENTIFIER Params Function_Type T_SEMI Params : T_LPAREN Params_List T_RPAREN | /* epsilon */ Param_Group : Id_List T_COLON Type | T_VAR Id_List T_COLON Type

Example Pascal Grammar in Yacc Function_Type : T_COLON Type | /* epsilon */ Params_List : Param_Group | Params_List T_SEMI Param_Group Constant : T_STRING | Number | T_PLUS Number | T_MINUS Number Number : T_IDENTIFIER | T_INTEGER | T_REAL Const_List : Constant | Const_List T_COMMA Constant Type : Simple_Type | T_UPARROW T_IDENTIFIER | Structured_Type Simple_Type : T_IDENTIFIER | T_LPAREN Id_List T_RPAREN | Constant T_RANGE Constant Structured_Type : T_ARRAY T_LBRACK Simple_Type_List T_RBRACK T_OF Type | T_SET T_OF Simple_Type | T_RECORD Field_List T_END Simple_Type_List : Simple_Type | Simple_Type_List T_COMMA Simple_Type

Example Pascal Grammar in Yacc Field_List : Fixed_Part /* Variant_part */ Fixed_Part : Field | Fixed_Part T_SEMI Field Field : /* epsilon */ | Id_List T_COLON Type Variant_part : / * epsilon * / | T_CASE T_IDENTIFIER T_OF Variant_List | T_CASE T_IDENTIFIER T_COLON T_IDENTIFIER T_OF Variant_List Variant_List : Variant | Variant_List T_SEMI Variant Variant : / * epsilon * / | Const_List T_COLON T_LPAREN Field_List T_RPAREN Stt_List : Statement | Stt_List T_SEMI Statement Case_Stt_List : Case_Statement | Case_Stt_List T_SEMI Case_Statement Case_Statement : Const_List T_COLON Statement | /* epsilon */

Example Pascal Grammar in Yacc Statement : /* epsilon */ | T_IDENTIFIER | T_IDENTIFIER T_LPAREN Expr_List T_RPAREN | Variable T_ASSIGN Expression | T_BEGIN Stt_List T_END | T_CASE Expression T_OF Case_Stt_List T_END | T_WHILE Expression T_DO Statement | T_REPEAT Stt_List T_UNTIL Expression | T_FOR Variable T_ASSIGN Expression T_TO Expression T_DO Statement | T_FOR Variable T_ASSIGN Expression T_DOWNTO Expression T_DO Statement | T_IF Expression T_THEN Statement | T_IF Expression T_THEN Statement T_ELSE Statement Expression : Expression relop Expression %prec T_LT | T_PLUS Expression %prec UNARYSIGN | T_MINUS Expression %prec UNARYSIGN | Expression addop Expression %prec T_PLUS | Expression divop Expression %prec T_MULT | T_NIL | T_STRING | T_INTEGER | T_REAL | Variable | T_IDENTIFIER T_LPAREN Expr_List T_RPAREN | T_LPAREN Expression T_RPAREN | negop Expression %prec T_NOT | T_LBRACK Element_List T_RBRACK | T_LBRACK T_RBRACK

Resolving Grammar Problems/Difficulties Reg. Lang. CFLs Regular Expressions : Basis of Lexical Analysis Reg. Expr.  generate/represent regular languages Reg. Languages  smallest, most well defined class of languages Context Free Grammars: Basis of Parsing CFGs  represent context free languages CFLs  contain more powerful languages anbn – CFL that’s not regular! Try to write DFA/NFA for it !

Resolving Problems/Difficulties – (2) a start a b b 0 2 1 3 b Since Reg. Lang.  Context Free Lang., it is possible to go from reg. expr. to CFGs via NFA. Recall: (a | b)*abb

Resolving Problems/Difficulties – (3) a b i j i j Construct CFG as follows: • Each State I has non-terminal Ai : A0, A1, A2, A3 • If then Aia Aj • If then Ai Aj • If I is an accepting state, Ai  : A3   • If I is a starting state, Ai is the start symbol : A0 : A0 aA0, A0 aA1 : A0 bA0,A1 bA2 : A2 bA3 T={a,b}, NT={A0, A1, A2, A3}, S = A0 PR ={ A0 aA0 | aA1 | bA0 ; A1  bA2 ; A2  3A3 ; A3   }

How Does This CFG Derive Strings ? a start a b b 0 2 1 3 b vs. A0 aA0, A0 aA1 A0 bA0, A1 bA2 A2 bA3, A3  How is abaabb derived in each ?

Regular Expressions vs. CFGs • Regular expressions for lexical syntax • CFGs are overkill, lexical rules are quite simple and straightforward • REs – concise / easy to understand • More efficient lexical analyzer can be constructed • RE for lexical analysis and CFGs for parsing promotes modularity, low coupling & high cohesion. Why? CFGs : Match tokens “(“ “)”, begin / end, if-then-else, whiles, proc/func calls, … Intended for structural associations between tokens ! Are tokens in correct order ?

Resolving Grammar Difficulties : Motivation • ambiguity • -moves • cycles • left recursion • left factoring • Humans write / develop grammars • Different parsing approaches have different needs LL(k) Recursive LR(k) LALR(k) Top-Down vs. Bottom-Up • For: 1  remove “errors” • For: 2  put / redesign grammar Grammar Problems

Resolving Problems: Ambiguous Grammars Consider the following grammar segment: stmt if exprthen stmt | if exprthen stmtelse stmt | other (any other statement) What’s problem here ? Consider the Program: if e1 then if e2 then s1 else s2 Else must match to previous then. Structure indicates parse subtree for expression.

Resulting Parse Tree • Easy case • Else must match to previous then. if e1 then s1 else if e2 then s2 else s3

Example : What Happens with this string? If E1then if E2then S1else S2 How is this parsed ? if E1then if E2then S1 else S2 if E1then if E2then S1 else S2 vs. What’s the issue here ?

Parse Trees for Example if e1 then if e2 then s1 else s2 Form 1: Form 2: What’s the issue here ?

Removing Ambiguity Take Original Grammar: stmt if exprthen stmt | if exprthen stmtelse stmt | other (any other statement) Revise to remove ambiguity: stmt  matched_stmt | unmatched_stmt matched_stmt if exprthen matched_stmt else matched_stmt | other unmatched_stmt if exprthen stmt | if exprthen matched_stmt else unmatched_stmt How does this grammar work ?

Resolving Difficulties : Left Recursion A left recursive grammar has rules that support the derivation : A  A, for some . + Top-Down parsing can’t reconcile this type of grammar, since it could consistently make choice which wouldn’t allow termination. A  A  A  A … etc. A A |  Take left recursive grammar: A  A |  To the following: A’  A’ A’  A’ | 

Why is Left Recursion a Problem ? Derive : id + id + id E  E + T  Consider: E  E + T | T T  T * F | F F  ( E ) | id How can left recursion be removed ? E  E + T | T What does this generate? E  E + T  T + T E  E + T  E + T + T  T + T + T  How does this build strings ? What does each string have to start with ?

Resolving Difficulties : Left Recursion (2) For our example: E  E + T | T T  T * F | F F  ( E ) | id E  TE’ E’  + TE’ |  T  FT’ T’  * FT’ |  F  ( E ) | id Informal Discussion: Take all productions for A and order as: A  A1 | A2 | … | Am | 1 | 2 | … | n Where no i begins with A. Now apply concepts of previous slide: A  1A’ | 2A’ | … | nA’ A’  1A’ | 2A’| … | m A’ | 

Resolving Difficulties : Left Recursion (3) S  Aa | b A  Ac | Sd |  S  Aa  Sda Problem: If left recursion is two-or-more levels deep, this isn’t enough Algorithm: • Input: Grammar G with no cycles or -productions • Output: An equivalent grammar with no left recursion • Arrange the non-terminals in some order A1,A2,…An • for i := 1 to n do begin • for j := 1 to i – 1 do begin • replace each production of the form Ai  Aj • by the productions Ai  1 | 2 | … | k • where Aj  1|2|…|k are all current Aj productions; • end • eliminate the immediate left recursion among Ai productions • end

Using the Algorithm Apply the algorithm to: A1  A2a | b A2  A2c | A1d |  i = 1 For A1 there is no left recursion i = 2 for j=1 to 1 do Take productions: A2 A1 and replace with A2  1  | 2  | … | k | where A1 1 | 2 | … | k are A1 productions in our case A2  A1d becomes A2  A2ad | bd What’s left: A1 A2a | b A2  A2 c | A2ad | bd |  Are we done ?

Using the Algorithm (2) No ! We must still remove A2 left recursion ! A1 A2a | b A2  A2 c | A2ad | bd |  Recall: A  A1 | A2 | … | Am | 1 | 2 | … | n A  1A’ | 2A’ | … | nA’ A’  1A’ | 2A’| … | m A’ |  Apply to above case. What do you get ?

Chapter 4: Syntax Analysis Part 1: Grammar Concepts

Chapter 4: Syntax Analysis Part 1: Grammar Concepts

Presentation Transcript

Part 1: Basic Fashion and Business Concepts

Basic Pacing Concepts Part II

Chapter 2 Pharmacokinetics

CHAPTER 6

Techniques in grammar instruction

Basic Pacing Concepts Part III

SOC4044 Sociological Theory Review of Basic Sociological Concepts Part I

Abstract Syntax

Analysis of Diction and Syntax

Semantic Analysis

Motion Analysis Summer Course

Formal Semantics

Chapter 25: Advanced Transaction Processing

Chapter 1 Diodes

Chapter 15 – Analysis and Impact of Leverage

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong

How to teach grammar?

Grammar Rule:

Chapter 4: Predictive Modeling

DNA sequencing “the technology lecture”