310 likes | 488 Views
2. Introduction To Compilers And Phase 1. Inside a compiler. Inside a C-- compiler. The compilation process. Example C-- code. Extended Backus-Naur form. Lexical analysis. The syntax of C--. The unit directory. Phase 1. p.cxx. C++ Compiler. Lexical analyser. p.lst. tokens.
E N D
2. Introduction To Compilers And Phase 1 • Inside a compiler. • Inside a C-- compiler. • The compilation process. • Example C-- code. • Extended Backus-Naur form. • Lexical analysis. • The syntax of C--. • The unit directory. • Phase 1.
p.cxx C++ Compiler Lexical analyser p.lst tokens Syntax analyser abstract syntax tree symbol table Optimiser Code generator a.out Inside A Compiler
p.c-- C-- Compiler Lexical analyser tokens Syntax analyser abstract syntax tree symbol table Code generator a.s Inside A C-- Compiler
The Compilation Process • There are three main stages to compilation : • Lexical analysis. • Syntax analysis. • Code generation. • Lexical analysis. • Recognising the individual components of the language. • Literals, identifiers, operators etc. • Throwing away irrelevant things like comments and whitespace. • Often called tokenising.
The Compilation Process II • Syntax analysis. • Recognising declarations, statements etc. • Detecting syntactic and static semantic errors. • Building the symbol table and abstract syntax tree (AST). • Code generation. • Generating machine code from the symbol table and the AST. • Most modern compilers also perform optimisation on the code after both syntax analysis (macro optimisation) and code generation (micro optimisation). • There are three phases to writing a compiler. • Phase 1 : write a lexical analyser. • Phase 2 : write a syntax analyser. • Phase 3 : write a code generator.
Example C-- Code • The factorial program : // Computes the factorial of a value read // from input. int a = 1 ; // Result int b = 0 ; // Data { cin >> b ; // Read data // Loop to compute factorial. while (b > 0) // Check for termination { a = a * b ; // Compute new a value b = b - 1 ; // Decrement b } cout << a ; // Output result } // End of program C-- is Turing Machine Equivalent.
Extended Backus-Naur Form (EBNF) • A way of formally defining syntax. • Production rules : non-terminal ::= syntactic_term • non-terminal : syntactic category. • terminal : A piece of program text (a.k.a. lexical token). • Five types of syntactic expression : X Y -- Sequence. X | Y -- Alternation. [ X ] -- Optional (0 or 1 occurrences). { X } -- Repetition (0 or more occurrences). ( X ) -- Bracketing. • Terminals must be distinguished from non-terminals (e.g. ‘if’).
EBNF Example • A simple language : sentence ::= noun_clause verb ( noun_clause | adverb ) ‘.’ noun_clause ::= article [ adj_list ] noun adj_list ::= adj { ‘,’ adj } article ::= ‘a’ | ‘the’ noun ::= ‘cat’ | ‘mouse’ adj ::= ‘black’ | ‘white’ | ‘thin’ | ‘fat’ adverb ::= ‘quickly’ | ‘slowly’ verb ::= ‘eats’ | ‘runs’
EBNF Example II • Any piece of text that can be produced by following the syntactic rules is valid. • Any piece of text that cannot be produced by following the syntactic rules is invalid - it contains a syntax error. • Possible sentences : a black, fat cat eats a white mouse. the mouse runs quickly. • Possible syntax errors : a black thin cat eats the black cat. the quickly, slowly fred. • Deciding whether a sentence is valid is called parsing. • Two phases to parsing : • Lexical analysis (phase 1). • Syntax analysis (phase 2).
ASSIGN IDENT ‘A’ IDENT ‘B’ ADDOP ‘+’ IDENT ‘C’ TERMINATOR Lexical Analysis • Tokenisation. • Splits the input characters into a series of structs each holding a terminal symbol and a designation of its kind. • Called as a subprogram by the syntax analyser. Called whenever syntax analyser needs another token. • Example : A = B + C ; • Also discards comments and whitespace. • First step : identify the lexical tokens in C--.
C-- Lexical Tokens • The lexical tokens are the terminal symbols of the grammar. • The lexical tokens in C-- can be split into the following groups : • Identifiers. • Literals. • Punctuation : =, ‘, ,,;, &, (, ), [, ], {, } and !. • Operators : • Relational : ==, !=, >, <, >= and <=. • Additional : +, - and ||. • Multiplicative : *, /, % and &&. • Reserved words : const, bool, string, int, if, else, while, cin and cout.
The Unit Directory • On Jaguar. Contains lots of useful stuff for this unit : /usr/users/staff/aosc/cm049icp/phase1 • makefile : A makefile to build the phase 1 program. • lexprog.cxx : The test bed program for phase 1. • lexer.h : The header file for phase 1. • skipWhiteComments.cxx, writeToken.cxx : to be included in your program (more later). • lexer : A compiled (and linked) executable for phase 1. • examples/*.c-- : Random C-- source programs. • If you write a good C-- program e-mail it to me and I’ll include it in the unit directory for everyone to admire.
The Unit Directory II • tests/test*.c-- : These are the C-- source programs that I’ll be using to test your lexer program during the demo. • To get maximum marks for phase 1 your program should produce exactly the same output as lexer when run on these source programs. • The C++ string library files are in /usr/users/staff/aosc/cm049icp/lib • They are : cstring.h : String library header file. string.cxx : String library implementation file. string.o : String library object code file. • Useful commands : testphase1, demophase1. Shell scripts for running the demo.
What You Must Do • Get a copies of makefile, writeToken.cxx and skipWhiteComments.cxx from the unit directory. • Probably a good idea to get copies of lexprog.cxx and lexer.h as well though it’s not necessary . • lexprog.cxx includes lexer.h which is the header file which contains the definition of the type LexToken and the prototypes for the subprogram which you must write : lexAnal. • You must put your implementation of lexAnal plus the contents of writeToken.cxx and skipWhiteComments.cxx into a file called lexer.cxx in your directory. • lexer.cxx will be linked into the executable because of the makefile.
Example Run • Assume that the file prog.c-- contains the following simple program : // Simple test program int a ; const string s = “Input : “ ; const string endl = “\n” ; { cin >> a ; cout << s ; cout << b ; cout << endl ; }
Example Run II • Use the makefile to compile and link your lexer into the file lexer. Then run it : jaguar> make lexer jaguar> lexer < prog.c-- INT IDENTIFIER : ‘a’ TERMINATOR CONST STRING IDENTIFIER : ‘s’ STRINGLIT : “Input :” TERMINATOR CONST STRING IDENTIFIER : ‘endl’ STRINGLIT : “\n” TERMINATOR LBRACE Lexer reads from cin and writes to cout each token’s kind and (if required), its value.
Example Run III CIN INOP IDENTIFIER : ‘a’ TERMINATOR COUT OUTOP IDENTIFIER : ‘s’ TERMINATOR COUT OUTOP IDENTIFIER : ‘b’ TERMINATOR COUT OUTOP IDENTIFIER : ‘endl’ TERMINATOR RBRACE jaguar>
You must write this subprogram. lexprog.cxx #include <iostream.h> #include ".../phase1/lexer.h" void main() { LexToken lexToken ; // Next lexical token skipWhiteComments() ; while (cin) { lexAnal(lexToken) ; writeToken(lexToken) ; cout << endl ; } } // main
lexer.h #ifndef LEXER_H #define LEXER_H #include ".../lib/cstring.h" enum LexTokenTag { ... } ; struct LexToken { ...} ; void lexAnal(LexToken &lexToken) ; void skipWhiteComments() ; void writeToken(LexToken lexToken) ; #endif
The LexToken Type enum LexTokenTag { IDENT, ..., COUT } ; struct LexToken { LexTokenTag tag ; // Tag field string ident ; // Identifier string boolLit ; // String literal string stringLit ; // String literal int intLit ; // Integer literal string addOp ; // Add operator string relOp ; // Rel operator string mulOp ; // Mul operator } ; // LexToken
lexAnal • Prototype : void lexAnal(LexToken &lexToken) ; • lexAnal reads the next lexical token from input and returns it via the reference parameter lexToken. • Assumes that the next character on cin is the first character of the next lexical token. • Done by calling skipWhiteComments. • Pre-defined C++ function to put a character back onto the start of the input stream after we’ve inspected it : char putback(char ch) ; • lexAnal inspects the next input character and uses it to decide what the next lexical token is.
lexAnal II • After token has been read lexAnal calls skipWhiteComments to read to start of next lexical token. • Top-level code for lexAnal : cin.get(next) ; if (!cin) // EOF encountered. Print error message and call exit. else { if ... // Nested if statement to recognise next character, // lex the token and return it in lexToken. else // Character not recognised. Print error // message and call exit. } skipWhiteComments() ;
Lookahead • 28 possible kinds of lexical token : one value in the LexTokenTag enum for each kind. • lexAnal can identify the kind of some tokens just by looking at the first character on input (i.e. by one character lookahead). This is true for tokens that start with (or consist of only) the following characters : • “, ,, ;, (, ), [, ], {, }, +, -, *, /, %, 0..9. • The following pairs of tokens require lexAnal to inspect the next two input characters (i.e. two character lookahead) to distinguish between them : = and == ! and != | and || > and >= < and <= & and && / and //
One Character Lookahead • The lexAnal code for this is fairly simple. • If the token kind has a value associated with it : if (next == ‘;’) { lexToken.tag = TERMINATOR ; } • If the token kind has a value associated with it : if (isdigit(next)) { cin.putback(next) ; lexIntLit(lexToken) ; } • The subprogram lexIntLit reads the integer literal and sets the fields of lexToken appropriately.
Reading String Literals • When the next input character is “ we have a string literal. if (next == ‘\”’) { cin.putback(next) ; lexStringLit(lexToken) ; } • lexStringLit uses a while loop to read the string literal character by character and appends them onto the end of lexToken.stringLit using the C++ string append operator, +. • The implementation of lexStringLit is in the file lexStringLit.cxx in the phase1 subdirectory of the unit directory. • The C++ string append operator (+) will also be useful when reading identifiers.
Two Character Lookahead • None of the token kinds that require two character lookahead have a value associated with them so the lexAnal code is actually fairly simple. • Example : if (next == ‘<’) { lexToken.tag = RELOP ; cin.get(next) ; if (next == ‘=‘) lexToken.relOp = “<=“ ; else if (next == ‘<‘) lexToken.tag = OUTOP ; else { cin.putback(next) ; lexToken.relOp = “<“ ; } }
Identifiers, Reserved Words And Boolean Literals • If the next character on the input is a letter then the token may be an identifier, a reserved word or a boolean literal (false or true). • The only way to decide is to read the string in character by character until a character which is not a letter, digit or ‘_’ is encountered. • Use the same method as in lexStringLit. • Once the string has been read : if ((string == “true”) || (string == “false”)) // It’s a boolean. else if (string == “const”) // const reserved word. else if ... else // It’s an identifier.
lexer.cxx • Put your implementation of lexAnal (and all the lexing subprograms) in a file called lexer.cxx in your own directory. • lexer.cxx must also #include the following files : “.../phase1/lexer.h” “.../phase1/cstring.h” <iostream.h> <ctype.h> <stdlib.h> • You may not need ctype.h and stdlib.h but you will certainly need all the others.
Lexer Errors • Most syntactic errors will be detected during syntax analysis. • Some syntactic errors can be detected during lexical analysis. • Lexer errors. • Already seen a few : • Unexpected EOF (see slide 2.22). • Unrecognised character (see slide 2.22). • A single | (see slide 2.23). • There is one other error which your lexer must detect : • A single “ without a corresponding “ after the string literal.
Lexer Errors II • Most compilers attempt to recover from errors in the program by guessing what the programmer actually meant. • This is very difficult to do which is why compiles produce so many spurious errors and rubbish error messages. • When your lexer detects an error it should simply output an error message to cout and call exit to terminate the program. • See lexStringLit.cxx for an example of how to do this. • Make sure you give exit a different integer parameter for different errors. • Make sure you give exit the same integer parameter for the same error. • e.g. always use exit(1) for unexpected end of file wherever in the lexer it is detected. • This is good programming practice.
Summary • Write a compiler from C-- to M68K assembly code. • Phase 1 : lexical analysis (10%). • Phase 2 : syntax analysis (20%). • Phase 3 : code generation (10%). • Write a parser for SCL. • Phase 4 : 60% • Lexical analysis converts program text into a series of lexical tokens. • Discards comments and whitespace. • Requires lookahead to determine token kind.