1 / 61

Chapter 4: Syntax Analysis Part 1: Grammar Concepts

Chapter 4: Syntax Analysis Part 1: Grammar Concepts. Prof. Steven A. Demurjian Computer Science & Engineering Department The University of Connecticut 371 Fairfield Way, Unit 2155 Storrs, CT 06269-3155. steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818.

myrna
Download Presentation

Chapter 4: Syntax Analysis Part 1: Grammar Concepts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 4: Syntax AnalysisPart 1: Grammar Concepts Prof. Steven A. Demurjian Computer Science & Engineering Department The University of Connecticut 371 Fairfield Way, Unit 2155 Storrs, CT 06269-3155 steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818 Material for course thanks to: Laurent Michel Aggelos Kiayias Robert LeBarre

  2. Syntax Analysis - Parsing • An overview of parsing : Functions & Responsibilities • Context Free Grammars: Concepts & Terminology • Writing and Designing Grammars • Resolving Grammar Problems / Difficulties • Top-Down Parsing • Recursive Descent & Predictive LL • Bottom-Up Parsing • LR & LALR • Key Issues in Error Handling • Concluding Remarks/Looking Ahead

  3. An Overview of Parsing • Why are Grammars to formally describe Languages Important ? • Precise, easy-to-understand representations • Compiler-writing tools can take grammar and generate a compiler • allow language to be evolved (new statements, changes to statements, etc.) Languages are not static, but are constantly upgraded to add new features or fix “old” ones • ADA  ADA9x, C++ Adds: Templates, exceptions, How do grammars relate to parsing process ?

  4. Parsing During Compilation errors token lexical analyzer rest of front end source program parse tree parser get next token symbol table regular expressions interm repres • also technically part or parsing • includes augmenting info on tokens in source, type checking, semantic analysis • uses a grammar to check structure of tokens • produces a parse tree • syntactic errors and recovery • recognize correct syntax • report errors

  5. Parsing Responsibilities • Syntax Error Identification / Handling • Recall typical error types: • Lexical : Misspellings • Syntactic : Omission, wrong order of tokens • Semantic : Incompatible types • Logical : Infinite loop / recursive call • Majority of error processing occurs during syntax analysis • NOTE: Not all errors are identifiable !! Which ones?

  6. Key issues in Error Handling • Detection • Actual Detection • Finding position at which they occur • Reporting • Clear / accurate presentation • Recovery • How to skip overto continue and find later errors • Cannot impact compilation of correct programs

  7. What are some Typical Errors ? #include<stdio.h> int f1(int v) { int i,j=0; for (i=1;i<5;i++) { j=v+f2(i) } return j; } int f2(int u) { int j; j=u+f1(u*u) return j; } int main() { int i,j=0; for (i=1;i<10;i++) { j=j+i*I; printf(“%d\n”,i); printf("%d\n",f1(j)); return 0; } As reported by MS VC++ 'f2' undefined; assuming extern returning intsyntax error : missing ';' before ‘}‘syntax error : missing ';' before ‘return‘fatal error : unexpected end of file found Which are “easy” to recover from? Which are “hard” ?

  8. Error Recovery Strategies Panic Mode – Discard tokens until a “synchro” token is found ( end, “;”, “}”, etc. ) -- Decision of designer -- Problems: skip input miss declaration – causing more errors miss errors in skipped material -- Advantages: simple suited to 1 error per statement Phase Level – Local correction on input -- “,” ”;” – Delete “,” – insert “;” -- Also decision of designer -- Not suited to all situations -- Used in conjunction with panic mode to allow less input to be skipped

  9. Error Recovery Strategies – (2) Error Productions: -- Augment grammar with rules -- Augment grammar used for parser construction / generation -- Add a rule for := in C assignment statements Report error but continue compile -- Self correction + diagnostic messages -- What are other rules ? (supported by yacc) Global Correction: -- Adding / deleting / replacing symbols is chancy – may do many changes ! -- Algorithms available to minimize changes costly - key issues What do you think of each approach? What approach do compilers you’ve used take ?

  10. Motivating Grammars Reg. Lang. CFLs • Regular Expressions •  Basis of lexical analysis •  Represent regular languages • Context Free Grammars •  Basis of parsing •  Represent language constructs •  Characterize context free languages EXAMPLE: anbn , n  1 : Is it regular ?

  11. Context Free Languages Context Free Languages Regular Languages Context Free Languages Reg. Languages Token Structure Sentence Structure Parser Generation Scanner Generation

  12. Context Free Grammars : Concepts & Terminology Definition: A Context Free Grammar, CFG, is described by T, NT, S, PR, where: T: Terminals / tokens of the language NT: Non-terminals to denote sets of strings generatable by the grammar & in the language S: Start symbol, SNT, which defines all strings of the language PR: Production rules to indicate how T and NT are combines to generate valid strings of the language. PR: NT  (T | NT)* Like a Regular Expression / DFA / NFA, a Context Free Grammar is a mathematical model !

  13. Context Free Grammars : A First Look assign_stmt id := expr ; expr  term operator term term id term real term integer operator + operator - What do “BLUE” symbols represent? What do “BLACK” symbols represent? Derivation: A sequence of grammar rule applications and substitutions that transform a starting non-term into a collection of terminals / tokens. Simply stated: Grammars / production rules allow us to “rewrite” and “identify” correct syntax.

  14. How is Grammar Used ? Given the rules on the previous slide, suppose id := real + int; is input. Is it syntactically correct? How do we know? expr is represented as: expr term operator term Is this accurate / complete? expr  expr operator term expr  term How does this affect the derivations which are possible?

  15. Derivation and Parse Tree Let’s derive: id := id + real – integer ; What’s its parse tree ?

  16. Derivation Example • Let’s derive: id := id + real – integer ; assign_stmt assign_stmt → id := expr ; → id := expr ; expr → expr operator term → id := expr operator term; expr → expr operator term → id := expr operator term operator term; expr → term → id := term operator term operator term; term → id → id := id operator term operator term; operator → + → id := id + term operator term; term → real → id := id + real operator term; operator → - → id := id + real - term; term → integer → id := id + real - integer; assign_stmt → id := expr ; expr → expr operator term expr → term term → id term → real term → integer operator → + operator → -

  17. Example Grammar expr  expr op expr expr ( expr ) expr - expr expr id op + op - op * op / op  Black : NT Blue : T expr : S 9 Production rules To simplify / standardize notation, we offer a synopsis of terminology.

  18. Example Grammar - Terminology Terminals: a,b,c,+,-,punc,0,1,…,9, blue strings Non Terminals: A,B,C,S, black strings T or NT: X,Y,Z Strings of Terminals: u,v,…,z in T* Strings of T / NT:  ,  in ( T  NT)* Alternatives of production rules: A 1; A 2; …; A k;  A  1 | 2 | … | 1 First NT on LHS of 1st production rule is designated as start symbol ! E  E A E | ( E ) | -E | id A  + | - | * | / | 

  19. Grammar Concepts A step in a derivation is zero or one action that replaces a NT with the RHS of a production rule. EXAMPLE: E  -E (the  means “derives” in one step) using the production rule: E  -E EXAMPLE: E  E A E  E * E  E * ( E ) DEFINITION:  derives in one step  derives in  one step  derives in  zero steps + * EXAMPLES:  A      if A  is a production rule 1  2 …  n  1  n ;    for all  If    and    then    * * * *

  20. How does this relate to Languages? + Let G be a CFG with start symbol S. Then S W (where W has no non-terminals) represents the language generated by G, denoted L(G). So WL(G)  S W. + W : is a sentence of G When S  (and  may have NTs) it is called a sentential form of G. EXAMPLE: id * id is a sentence Here’s the derivation: E  E A E E * E  id * E  id * id E id * id Sentential forms *

  21. Leftmost and Rightmost Derivations   lm rm Leftmost: Replace the leftmost non-terminal symbol E  E A E  id A E  id * E  id * id Rightmost: Replace the leftmost non-terminal symbol E  E A E  E A id  E *id  id * id lm lm lm lm rm rm rm rm Important Notes: A   If A    , what’s true about  ? If A    , what’s true about  ? Derivations: Actions to parse input can be represented pictorially in a parse tree.

  22. Examples of LM / RM Derivations E  E A E | ( E ) | -E | id A  + | - | * | / |  A leftmost derivation of : id + id * id A rightmost derivation of : id + id * id

  23. Derivations & Parse Tree E E E E * E E E E A A A A E E E E id * id id * E  E A E  E * E  id * E  id * id

  24. Parse Trees and Derivations E E E E E + E  id+ E E E E E + + * + E E E E id+ E  id+ E * E id id Consider the expression grammar: E  E+E | E*E | (E) | -E | id Leftmost derivations of id + id * id E  E + E

  25. Parse Tree & Derivations - continued id+ E * E  id+id* E id id id id id E E E E E E E E + * * + E E E E id+id* E  id+id* id

  26. Alternative Parse Tree & Derivation E E E * E E + E id id id E  E * E  E + E * E  id + E * E  id + id * E  id + id * id WHAT’S THE ISSUE HERE ?

  27. Example Pascal Grammar in Yacc %term T_AND T_ARRAY T_BEGIN T_CASE T_CONST T_DIV T_DO T_RANGE T_TO T_ELSE T_END T_FILE T_FOR T_FUNCTION /* T_GOTO */ T_IDENTIFIER T_IF T_IN T_INTEGER T_LABEL T_MOD T_NOT T_REAL T_OF T_OR /* T_PACKED */ T_NIL T_PROCEDURE T_PROGRAM T_RECORD T_REPEAT T_SET T_STRING T_THEN T_DOWNTO T_TYPE T_UNTIL T_VAR T_WHILE /* T_WITH */ T_COMMA T_PERIOD T_ASSIGN T_COLON T_LPAREN T_RPAREN T_LBRACK T_RBRACK T_UPARROW T_SEMI %binary T_LT T_EQ T_GT T_GE T_LE T_NE T_IN %left T_PLUS T_MINUS T_OR %left UNARYSIGN %left T_MULT T_RDIV T_DIV T_MOD T_AND %left T_NOT %%

  28. Example Pascal Grammar in Yacc Start : Program_Header Declarations Block T_PERIOD Program_Header : T_PROGRAM T_IDENTIFIER T_LPAREN Id_List T_RPAREN T_SEMI Block : T_BEGIN Stt_List T_END Declarations : Const_Def_Part Type_Def_Part Var_Decl_Part Routine_Decl_Part Const_Def_Part : T_CONST Const_Def_List | /* epsilon */ ; Const_Def_List : Const_Def | Const_Def_List Const_Def ; Const_Def : T_IDENTIFIER T_EQ Constant T_SEMI Type_Def_Part : T_TYPE Type_Def_List | /* epsilon */ Type_Def_List : Type_Def | Type_Def_List Type_Def

  29. Example Pascal Grammar in Yacc Type_Def : T_IDENTIFIER T_EQ Type T_SEMI Var_Decl_Part : T_VAR Var_Decl_List | /* epsilon */ Var_Decl_List : Var_Decl_List Var_Decl | Var_Decl Var_Decl : Id_List T_COLON Type T_SEMI Routine_Decl_Part : Routine_Decl_Part Routine_Decl | /* epsilon */ Routine_Decl : Routine_Head T_IDENTIFIER T_SEMI | Routine_Head Declarations Block T_SEMI Routine_Head : T_PROCEDURE T_IDENTIFIER Params T_SEMI | T_FUNCTION T_IDENTIFIER Params Function_Type T_SEMI Params : T_LPAREN Params_List T_RPAREN | /* epsilon */ Param_Group : Id_List T_COLON Type | T_VAR Id_List T_COLON Type

  30. Example Pascal Grammar in Yacc Function_Type : T_COLON Type | /* epsilon */ Params_List : Param_Group | Params_List T_SEMI Param_Group Constant : T_STRING | Number | T_PLUS Number | T_MINUS Number Number : T_IDENTIFIER | T_INTEGER | T_REAL Const_List : Constant | Const_List T_COMMA Constant Type : Simple_Type | T_UPARROW T_IDENTIFIER | Structured_Type Simple_Type : T_IDENTIFIER | T_LPAREN Id_List T_RPAREN | Constant T_RANGE Constant Structured_Type : T_ARRAY T_LBRACK Simple_Type_List T_RBRACK T_OF Type | T_SET T_OF Simple_Type | T_RECORD Field_List T_END Simple_Type_List : Simple_Type | Simple_Type_List T_COMMA Simple_Type

  31. Example Pascal Grammar in Yacc Field_List : Fixed_Part /* Variant_part */ Fixed_Part : Field | Fixed_Part T_SEMI Field Field : /* epsilon */ | Id_List T_COLON Type Variant_part : / * epsilon * / | T_CASE T_IDENTIFIER T_OF Variant_List | T_CASE T_IDENTIFIER T_COLON T_IDENTIFIER T_OF Variant_List Variant_List : Variant | Variant_List T_SEMI Variant Variant : / * epsilon * / | Const_List T_COLON T_LPAREN Field_List T_RPAREN Stt_List : Statement | Stt_List T_SEMI Statement Case_Stt_List : Case_Statement | Case_Stt_List T_SEMI Case_Statement Case_Statement : Const_List T_COLON Statement | /* epsilon */

  32. Example Pascal Grammar in Yacc Statement : /* epsilon */ | T_IDENTIFIER | T_IDENTIFIER T_LPAREN Expr_List T_RPAREN | Variable T_ASSIGN Expression | T_BEGIN Stt_List T_END | T_CASE Expression T_OF Case_Stt_List T_END | T_WHILE Expression T_DO Statement | T_REPEAT Stt_List T_UNTIL Expression | T_FOR Variable T_ASSIGN Expression T_TO Expression T_DO Statement | T_FOR Variable T_ASSIGN Expression T_DOWNTO Expression T_DO Statement | T_IF Expression T_THEN Statement | T_IF Expression T_THEN Statement T_ELSE Statement Expression : Expression relop Expression %prec T_LT | T_PLUS Expression %prec UNARYSIGN | T_MINUS Expression %prec UNARYSIGN | Expression addop Expression %prec T_PLUS | Expression divop Expression %prec T_MULT | T_NIL | T_STRING | T_INTEGER | T_REAL | Variable | T_IDENTIFIER T_LPAREN Expr_List T_RPAREN | T_LPAREN Expression T_RPAREN | negop Expression %prec T_NOT | T_LBRACK Element_List T_RBRACK | T_LBRACK T_RBRACK

  33. Example Pascal Grammar in Yacc Element_List : Element | Element_List T_COMMA Element Element : Expression | Expression T_RANGE Expression Variable : T_IDENTIFIER | Variable T_LBRACK Expr_List T_RBRACK | Variable T_PERIOD T_IDENTIFIER | Variable T_UPARROW Expr_List : Expression | Expr_List T_COMMA Expression relop : T_EQ | T_LT | T_GT | T_NE | T_LE | T_GE | T_IN addop : T_PLUS | T_MINUS | T_OR divop : T_MULT | T_RDIV | T_DIV | T_MOD | T_AND ; negop : T_NOT

  34. Resolving Grammar Problems/Difficulties Reg. Lang. CFLs Regular Expressions : Basis of Lexical Analysis Reg. Expr.  generate/represent regular languages Reg. Languages  smallest, most well defined class of languages Context Free Grammars: Basis of Parsing CFGs  represent context free languages CFLs  contain more powerful languages anbn – CFL that’s not regular! Try to write DFA/NFA for it !

  35. Resolving Problems/Difficulties – (2) a start a b b 0 2 1 3 b Since Reg. Lang.  Context Free Lang., it is possible to go from reg. expr. to CFGs via NFA. Recall: (a | b)*abb

  36. Resolving Problems/Difficulties – (3) a b i j i j Construct CFG as follows: • Each State I has non-terminal Ai : A0, A1, A2, A3 • If then Aia Aj • If then Ai Aj • If I is an accepting state, Ai  : A3   • If I is a starting state, Ai is the start symbol : A0 : A0 aA0, A0 aA1 : A0 bA0,A1 bA2 : A2 bA3 T={a,b}, NT={A0, A1, A2, A3}, S = A0 PR ={ A0 aA0 | aA1 | bA0 ; A1  bA2 ; A2  3A3 ; A3   }

  37. How Does This CFG Derive Strings ? a start a b b 0 2 1 3 b vs. A0 aA0, A0 aA1 A0 bA0, A1 bA2 A2 bA3, A3  How is abaabb derived in each ?

  38. Regular Expressions vs. CFGs • Regular expressions for lexical syntax • CFGs are overkill, lexical rules are quite simple and straightforward • REs – concise / easy to understand • More efficient lexical analyzer can be constructed • RE for lexical analysis and CFGs for parsing promotes modularity, low coupling & high cohesion. Why? CFGs : Match tokens “(“ “)”, begin / end, if-then-else, whiles, proc/func calls, … Intended for structural associations between tokens ! Are tokens in correct order ?

  39. Resolving Grammar Difficulties : Motivation • ambiguity • -moves • cycles • left recursion • left factoring • Humans write / develop grammars • Different parsing approaches have different needs LL(k) Recursive LR(k) LALR(k) Top-Down vs. Bottom-Up • For: 1  remove “errors” • For: 2  put / redesign grammar Grammar Problems

  40. Resolving Problems: Ambiguous Grammars Consider the following grammar segment: stmt if exprthen stmt | if exprthen stmtelse stmt | other (any other statement) What’s problem here ? Consider the Program: if e1 then if e2 then s1 else s2 Else must match to previous then. Structure indicates parse subtree for expression.

  41. Resulting Parse Tree • Easy case • Else must match to previous then. if e1 then s1 else if e2 then s2 else s3

  42. Example : What Happens with this string? If E1then if E2then S1else S2 How is this parsed ? if E1then if E2then S1 else S2 if E1then if E2then S1 else S2 vs. What’s the issue here ?

  43. Parse Trees for Example if e1 then if e2 then s1 else s2 Form 1: Form 2: What’s the issue here ?

  44. Removing Ambiguity Take Original Grammar: stmt if exprthen stmt | if exprthen stmtelse stmt | other (any other statement) Revise to remove ambiguity: stmt  matched_stmt | unmatched_stmt matched_stmt if exprthen matched_stmt else matched_stmt | other unmatched_stmt if exprthen stmt | if exprthen matched_stmt else unmatched_stmt How does this grammar work ?

  45. Resolving Difficulties : Left Recursion A left recursive grammar has rules that support the derivation : A  A, for some . + Top-Down parsing can’t reconcile this type of grammar, since it could consistently make choice which wouldn’t allow termination. A  A  A  A … etc. A A |  Take left recursive grammar: A  A |  To the following: A’  A’ A’  A’ | 

  46. Why is Left Recursion a Problem ? Derive : id + id + id E  E + T  Consider: E  E + T | T T  T * F | F F  ( E ) | id How can left recursion be removed ? E  E + T | T What does this generate? E  E + T  T + T E  E + T  E + T + T  T + T + T  How does this build strings ? What does each string have to start with ?

  47. Resolving Difficulties : Left Recursion (2) For our example: E  E + T | T T  T * F | F F  ( E ) | id E  TE’ E’  + TE’ |  T  FT’ T’  * FT’ |  F  ( E ) | id Informal Discussion: Take all productions for A and order as: A  A1 | A2 | … | Am | 1 | 2 | … | n Where no i begins with A. Now apply concepts of previous slide: A  1A’ | 2A’ | … | nA’ A’  1A’ | 2A’| … | m A’ | 

  48. Resolving Difficulties : Left Recursion (3) S  Aa | b A  Ac | Sd |  S  Aa  Sda Problem: If left recursion is two-or-more levels deep, this isn’t enough Algorithm: • Input: Grammar G with no cycles or -productions • Output: An equivalent grammar with no left recursion • Arrange the non-terminals in some order A1,A2,…An • for i := 1 to n do begin • for j := 1 to i – 1 do begin • replace each production of the form Ai  Aj • by the productions Ai  1 | 2 | … | k • where Aj  1|2|…|k are all current Aj productions; • end • eliminate the immediate left recursion among Ai productions • end

  49. Using the Algorithm Apply the algorithm to: A1  A2a | b A2  A2c | A1d |  i = 1 For A1 there is no left recursion i = 2 for j=1 to 1 do Take productions: A2 A1 and replace with A2  1  | 2  | … | k | where A1 1 | 2 | … | k are A1 productions in our case A2  A1d becomes A2  A2ad | bd What’s left: A1 A2a | b A2  A2 c | A2ad | bd |  Are we done ?

  50. Using the Algorithm (2) No ! We must still remove A2 left recursion ! A1 A2a | b A2  A2 c | A2ad | bd |  Recall: A  A1 | A2 | … | Am | 1 | 2 | … | n A  1A’ | 2A’ | … | nA’ A’  1A’ | 2A’| … | m A’ |  Apply to above case. What do you get ?

More Related