4. Phase 2 : Syntax Analysis Part II • The unit directory. • What you must do. • Example run. • syner.cxx • The lookahead convention. • Error detection and recovery. • Symbol table lookup. • Parsing declarations. • Parsing statements. • Parsing expressions.
The Unit Directory • The unit directory for phase 2 is : /usr/users/staff/aosc/cm049icp/phase2 • Among other things, it contains the following : • synprog.cxx : the test bed program for phase 2. • syner.template : A template file for your phase 2 program. • syner.h : header file for phase2. • AST : Abstract syntax tree data structure. • SymTab : Symbol table data structure. • syntax : Array of syntax error messages. • statics : Array of static semantic error messages. • type : Array of type error messages. • makefile : The makefile for phase 2. • syner : An executable for my phase 2 program.
The Unit Directory II • tests/test*.c-- : Testing programs for the demo. • printers.cxx : Printing subprograms. void printAST(AST *ast, // AST int &line, // Line number int indent) // Indentation void printST(SymTab *st) // Symbol Table • utilities.cxx : General utilities void error(int number, // Error no. LexToken lexToken) // Token bool lookup(LexToken lexToken, // Token SymTab *st, // Sym. Tab. SymTab *&match) // Entry
You must write this subprogram synprog.cxx • The test bed program is as follows : #include “.../phase2/syner.h” void main() { SymTab *st = NULL ; AST *ast = NULL ; int line = 1 ; int indent = 0 ; int label = 0 ; synAnal(st, ast, label) ; printAST(ast, line, indent) ; printST(st) ; } • printAST : Pretty prints the AST. • printST : Pretty prints the symbol table. • They ensure that your program’s output format is the same as mine.
What You Must Do • Your implementation of synAnal must be in a file called syner.cxx in your directory. • Take a copy of makefile and syner.template. • Print out a copy of syner.h. • Print out a copy of utilities.cxx. • Print out a copy of printers.cxx. • Useful commands : testphase2, demophase2. • Shell scripts for running the phase 2 demo. • They rely on your synAnal using the printing subprograms from printers.cxx. • You must #includeprinters.cxx and utilities.cxx into your syner.cxx file.
Example Run • Assuming we have the following valid C-- program in a file called prog.c-- : int i = 10 ; { while (i > 0) { cout << i ; } ; } • Make and run syner : jaguar> make syner jaguar> syner < prog.c-- 1 -- { 2 -- while (i > 0) /* Start label : 0 */ 3 -- { 4 -- cout << i ; 5 -- } /* End label : 1 */ ; 6 -- } Name Type Constant Initial Value i int False 10 jaguar> Output format may change. Make sure printers.cxx is #included in your syner.cxx.
Example Run II • Assuming we have the following invalid C-- program in a file called prog.c-- : bool b = true ; { cout >> b ; } • Run syner : jaguar> syner < prog.c-- Type error 218. Attempt to input to a non-integer variable. IDENTIFIER : b jaguar>
Top Level Structure For syner.cxx • Contents of syner.template : #include <iostream.h> #include <iomanip.h> #include <ctype.h> #include <stddef.h> #include <stdlib.h> #include “.../lib/cstring.h” #include “.../phase2/syner.h” #include “.../phase1/lexer.h” #include “.../phase2/printers.cxx #include “.../phase2/utilities.cxx void synDec(SymTab *&st, LexToken &lexToken) { cout << “synDec\n” ; } void synDeclarations(SymTab *&st, LexToken &lexToken) { cout << “synDeclarations\n” ; }
Top Level Structure For syner.cxx II // Forward declaration. void synExpression(SymTab *st, Expression *&expr, LexToken &lexToken, DataType &type) ; void synFactor(SymTab *st, Factor *&fact, LexToken &lexToken) { cout << “synFactor\n” ; } void synTerm(SymTab *st, Term *&term, LexToken &lexToken, DataType &type) { cout << “synTerm\n” ; } void synBasicExp(SymTab *st, BasicExp *&bexp, LexToken &lexToken) DataType &type) { cout << “synBasicExp\n” ; }
Top Level Structure For syner.cxx III void synExpression(SymTab *st, Expression *&expr, LexToken &lexToken, DataType &type) { cout << “synExpression\n” } //Forward declaration. void synStatements(SymTab *st, AST *&ast, int &label, LexToken &lexToken) ; void synIfSt(SymTab *st, AST *&ast, int &label, LexToken &lexToken) { cout << “synIfSt\n” ; } void synWhileSt(SymTab *st, AST *&ast, int &label, LexToken &lexToken) { cout << “synWhileSt\n” ; }
Top Level Structure For syner.cxx IV void synCinSt(SymTab *st, AST *&ast, int &label, LexToken &lexToken) { cout << “synCinSt\n” ; } void synCoutSt(SymTab *st, AST *&ast, int &label, LexToken &lexToken) { cout << “synCoutSt\n” ; } void synAssignSt(SymTab *st, AST *&ast, int &label, LexToken &lexToken) { cout << synAssignSt\n” ; }
Top Level Structure For syner.cxx V void synStatement(SymTab *st, AST *&ast, int &label, LexToken &lexToken) { cout << “synStatement\n” ; } void synStatements(SymTab *st, AST *&ast, int &label, LexToken &lexToken) { cout << “synStatements\n” ; }
Top Level Structure For syner.cxx VI void synAnal(SymTab *&st, AST *&ast, int &label) { LexToken lexToken ; // Current token skipWhiteComments() ; lexAnal(lexToken) ; st = NULL ; ast = NULL ; synDeclarations(st, lexToken) ; synStatements(st, ast, label, lexToken) ; } • Note the strange way pointers are passed as reference parameters : • label holds the number of the next M68K label to be used. All labels are of the form L0, L1, .. etc.
Lookahead Conventions • C-- has an LL(1) grammar. • Sometimes we must read one token beyond the end of a syntactic construct. • When parsing an if statement to determine whether or not there’s an else part. • Conventions : • 1. All the syntax analysis subprograms assume that they will be passed their first token as a parameter. • 2. All the syntax analysis subprograms will read one token beyond the end of the syntactic construct they are attempting to parse and will pass that token back to their caller. • Exception : synStatements does not follow convention 2. • Otherwise it would read past the end of the input file. • Luckily, synStatements does not need to lookahead.
Error Detection & Recovery • Detecting errors is one thing that we’ll award marks for. • RTFC for syner.cxx to see what errors you have to detect. • Syntax errors. 27 = A ; • Type errors. bool b ; { cin >> b ; } • Static semantic errors. string s ; • As usual, upon detection of an error simply call error with an error number and the offending lexical token. error(101, lexToken) ; • error prints an error message and calls exit.
Symbol Table Lookup • One of the most common static semantic errors is the use of an undeclared variable. • The following subprogram is in utilities.cxx : bool lookup(LexToken lexToken, SymTab *st, SymTab *&match) • lookup looks inspects the symbol table to see if the identifier held in lexToken has already been declared. • If not already declared returns false and match is set to NULL. • If already declared returns true and sets match to a copy of the Entry for the identifier.
Symbol Table Lookup II • lookup is also useful for checking that identifiers have the correct type. • Note that lookup is a value returning subprogram which may also assign a value to a reference parameter. • i.e. it is a side effecting ‘function’. • This is disgusting programming practice. • It’s also standard programming practice in this instance. • Don’t do this kind of thing yourself unless you really want a zero mark.
Code For synDeclarations void synDeclarations(SymTab *&st, LexToken &lexToken) { while (lexToken.tag != LBRACE) { synDec(st, lexToken) ; if (lexToken.tag != TERMINATOR) error(3, lexToken) ; else lexAnal(lexToken) ; } } • error is in utilities.cxx. • Prints out the error message corresponding to the number. • 0..99 : Syntax errors. • 100..199 : Static semantic errors. • 200..299 : Type errors. • Prints out the offending token (using writeToken). • Calls exit to terminate the program.
Top Level Code For synDec void synDec(SymTab *&st, LexToken &lexToken) { SymTab *newEntry ; // For this declaration SymTab *dummy ; // For lookup LexToken idToken ; // Var/const name Create new Symtab entry and initialise it (set type to VOIDDATA). if (lexToken.tag == CONST) { newEntry->constFlag = true ; Lex another token into lexToken. } lexToken is the type. Set type field. If not valid type raise an error. Lex the identifier token. Copy it to idToken for better error reporting. if (lexToken.tag == IDENT) newEntry->ident = lexToken.ident ; else error(1, idToken) ; if (lookup(lexToken, st, dummy)) error(101, idToken) ;
Top Level Code For synDec II Lex next token. if (lexToken.tag == ASSIGN) { newEntry->initialise = new Factor ; Lex next token. if (lexToken.tag == BOOLLIT) { if (newEntry->type != BOOLDATA) error(217, idToken) ; newEntry->initialise->literal = true ; newEntry->initialise->type = BOOLDATA ; newEntry->initialise->litBool = lexToken.boolLit ; } else if (lexToken.tag == STRINGLIT) { As above but for a string. } else if (lexToken.tag == INTLIT) { As above but for an int. } else Not a literal so call error. Lex next token. }
Top Level Code For synDec III if ((newEntry->constFlag) && (newEntry->Initialise == NULL)) error(103,idToken) ; if ((newEntry->type == STRINGDATA) && (!newEntry->constFlag)) error(104,idToken) ; Add newEntry to the front of st. } // synDec • Note that this will build the symbol table in reverse order of the order of the declarations. • This is no problem. • When parsing some languages (e.g. C, C++) it’s actually an advantage.
Top Level Code For synStatements void synStatements(SymTab *st, AST *&ast, int &label, LexToken &lexToken) { AST *newast = NULL ; // Next statement AST *temp1 = NULL ; // For reversing AST *temp2 = NULL ; // the statement AST *temp3 = NULL ; // list if (lexToken.tag != LBRACE) error(4, lexToken) ; Lex the next token. while (lexToken.tag != RBRACE) { newast = new AST ; synStatement(st, newast, label, lexToken) ; if (lexToken.tag != TERMINATOR) error(8, lexToken) ;
Top Level Code For synStatements II if (lexToken.tag != TERMINATOR) error(8, lexToken) ; newast->next = ast ; ast = newast ; Lex the next token. } temp1 = ast ; while (temp1 != NULL) { temp3 = temp1->next ; temp1->next = temp2 ; temp2 = temp1 ; temp1 = temp3 ; } ast = temp2 ; } // synStatements Statement list built in reverse order. Must reverse it back again after it has been parsed.
synStatement • synStatement will call the following subprograms : • synIfSt, synWhileSt, synCinSt, synCoutSt, synAssignSt. • synStatement decides which to call via an if statement which examines lexToken.tag. • IF token : call synIfSt. WHILE token : call synWhileSt. CIN token : call synCinSt. COUT token : call synCoutSt, IDENT token : call synAssignSt. None of the above : call error. • synIfSt and synWhileSt will call synStatements to syntax analyse their compound statements. • Recursive Descent parsing : a set of mutually recursive subprograms, one per production rule in the EBNF syntax.
Parsing Input Statements • By far the easiest of the statement parsing subprograms are synCinSt and synCoutSt : void synCinSt(SymTab *st, AST *&ast, int &label, LexToken &lexToken) { ast->tag = CINST ; ast->cinst = new CinSt ; lexAnal(lexToken) ; if (lexToken.tag != INOP) error(9, lexToken) ; lexAnal(lexToken) ; if (lexToken.tag != IDENT) error(11, lexToken) ;
Parsing Input Statements II if (!lookup(lexToken, st, ast->cinst->invar)) error(102, lexToken) ; if (ast->cinst->invar->type != INTDATA) error(216, lexToken) ; if (ast->cinst->invar->constFlag) error(105, lexToken) ; lexAnal(lexToken) ; } • Most of the code is to check for syntax, static semantic and type errors.
Parsing if-else Statements • This is slightly more complex : • Need to use lookahead. • Must handle the label field. • Must parse the conditional expression. • Must parse the enclosed statements. void synIfSt(SymTab *st, AST *&ast, int &label, LexToken &lexToken) { DataType type = VOIDDATA ; Expression *expr = NULL ; AST *thenpart = NULL ; AST *elsepart = NULL ;
Parsing if-else Statements II ast->tag = IFST ; ast->ifst = new IfSt ; ast->ifst->elselabel = label++ ; ast->ifst->endlabel = label++ ; lexAnal(lexToken) ; if (lexToken.tag != LPAREN) error(6, lexToken) ; lexAnal(lexToken) ; synExpression(st, expr, lexToken, type) ; if (type != BOOLDATA) error(202, lexToken) ; if (lexToken.tag != RPAREN) error(7, lexToken) ;
Parsing if-else Statements III lexAnal(lexToken) ; synStatements(st, thenpart, label, lexToken) ; lexAnal(lexToken) ; if (lexToken.tag == ELSE) { lexAnal(lexToken) ; synStatements(st, elsepart, label, lexToken) ; lexAnal(lexToken) ; } ast->ifst->condition = expr ; ast->ifst->thenstats = thenpart ; ast->ifst->elsestats = elsepart ; } // synIfSt synStatements doesn’t lex ahead so must lex another token after each call to it.
Expressions • synIfSt must call synExpression to parse the conditional expression. • So must synWhileSt and synAssignSt. • The next lecture covers how to write synExpression. • For now, assume that all expressions are simply literal constants. if (true) while (false) x = 42; { ... } ; { ... } ; • Code to parse literals is on slide 20. • Handling expressions can be a bit fiddly. • Not difficult, just fiddly.
Summary • Copy syner.template, makefile and syner (renamed dhsyner) into your directory. • Print out syner.h, utilities.cxx and printers.cxx. • Rename syner.template to syner.cxx. • Complete the stubs in syner.cxx in the following order : • synDeclarations, synDec, synStatements, synStatement, synCinSt, synCoutSt, synAssignSt, synWhileSt, synIfSt. • For now, assume all expressions are simply literal constants. • synExpression uses code from slide 20.