880 likes | 953 Views
CS 35000. Fall 2016. Chapter 3 Syntax and Semantics. Peter A. Ng CS 35000 Programming Language Design Indiana University – Purdue University Fort Wayne. Chapter 3 Outline. Introduction The General Terminology of Describing Syntax Grammar: Formal Methods of Describing Syntax
E N D
CS 35000 Fall 2016 Chapter 3Syntax and Semantics Peter A. Ng CS 35000 Programming Language Design Indiana University – Purdue University Fort Wayne
Chapter 3 Outline • Introduction • The General Terminology of Describing Syntax • Grammar: Formal Methods of Describing Syntax • Attribute Grammars • Describing the Meanings of Programs: Dynamic Semantics Fundamental Problem for this chapter is ---- How does acomputer understand computer programs’ representation?
Chapter 3 Outline • Introduction • The General Terminology of Describing Syntax • Grammar: Formal Methods of Describing Syntax • Attribute Grammars • Describing the Meanings of Programs: Dynamic Semantics Fundamental Problem for this chapter is ---- How does acomputer understand computer programs’ representation?
Syntax and Semantics • Syntax: the form or structure of expressions, statements, and program units • Define what is grammatically correct • System.out.println("Grade = " + grade); Correct syntax in Java • System.out.println("Grade = “,grade) Incorrect syntax in Java • Semantics: the meaning of the expressions, statements, and program units. Example: • The correctness of syntax is the pre-requisite of correctness of semantic • while (bool_expression) statement • Syntax and semantics are closely related • They provide a language’s definition • It is easier to describe syntax than semantics
General Terminology of Describing Syntax • language: is a set of sentences/statements • Statement:is a string of characters over some alphabet • An alphabet is a standard set of characters, such as The ASCII and the Extended ASCII. • Lexeme:is the lowest level syntactic unit of a statement/language (e.g., int, testscore, =, 90, ;) • Token: is a category of lexemes (e.g., identifier)
Example index = 2 * count + 17; Lexemes Tokens index identifier countidentifier =equal_sign * mult_op +plus_op 2int_literal 17int_literal ;semicolon or delimiter
token getNextToken Lexical Analyzer Parser Source Program to semantic analysis Symbol Table Figure: Interactions between the lexical analyzer and the parser.
One token for each keyword. Tokens for the operators, either individually or in classes, such as the token comparison. One token representing all identifiers. One or more tokens representing constants, such as numbers and literal strings. Tokens for each punctuation symbol, such as left and right parentheses, comma, and semicolon.
Attributes for Tokens The token names and associated attribute values for E = M * C **2 are written below as a sequence of pairs: <id, pointer to symbol-table entry for E> <assign-op> <id, pointer to symbol-table entry for M> <mult-op> <id, pointer to symbol-table entry for C> <exp-op> <number, integer value 2>
= < start 0 3 2 return(relop, LE) > 4 = other return(relop, NE) * 5 > return(relop, LT) 6 return(relop, EQ) 2 = 7 return(relop, GE) * Need to retract the forward pointer on position. other * 8 return(relop, GT)
Figure: A transition diagram for id’s and keywords. Letter or digit * other start letter 9 10 11 return (getToken(), installID The function of getToken examines the symbol table entry fro the lexeme found, and returns whatever token name the symbol table says this lexeme represents – either id or one of the keyword tokens that was initially installed in the table. Install the reserved words in the symbol table initially. When we find an identifier, a call to installID places it in the symbol table if it is not already there and returns a pointer to the symbol-table entry for the lexeme found. (any identifier not in the symbol table during lexical analysis cannot be a reserved word, so its token is id.
delim * start delim delim 22 23 24 Figure: A transition diagram for whitespace
TOKEN getRelop() { TOKEN retToken = new(RELOP); while (1) { /* repeat character processing until a return or failure occurs */ switch(state) { case 0: c = nextChar(); if (c == ‘<‘ ) state = 1; else if (c == ‘=‘ ) state = 5; else if (c == ‘>’ ) state = 6; else fail(); /* lexeme is not a relop */ break; case 1: … …. case 8: retract(); retToken.attribute = GT; return(retToken); } } }
Recognize orGeneration of A Language • A language can be formally defined in two ways: recognitionor generation • Recognition is implemented as a recognizer R • Generation is implemented as a generator G
Recognize orGeneration of A Language • A language can be formally defined in two ways: recognitionor generation • Recognition is implemented as a recognizer R • Given an input string, Rdetermines ifthe string is eitherin a language L or not in L • A recognizerdoes not enumerate all sentences in L • Because L is normally countably infinite set • However, it can be used to analyze the syntax of a program, • Syntax Analyzer Part of a compiler • Generation is implemented as a generator G
Recognize orGeneration of A Language • A language can be formally defined in two ways: recognitionor generation • Recognition is implemented as a recognizer R • Generation is implemented as a generator G • G can be used to generate infinite sentences of L • One can determine if the syntax of a particular sentence is syntactically correct by comparing it to the structure of the generator • A grammar is a language generator
Formal Definition of Languages • Grammar is the formal language generation mechanism • It is commonly used to describe the syntax of a programming language. • Grammars are the language of languages • Two kinds of grammars • Context-Free Grammars (CFG) • Backus-Naur Form (BNF)
Formal Definition of Languages • Grammar is the formal language generation mechanism • Context-Free Grammars (CFG) • A context-free grammar (CFG) G is a quadruple (V, Σ, R, S) where • V: a set of non-terminal symbols • Σ: a set of terminals (V ∩ Σ = Ǿ) • R: a set of rules (R: V → (V U Σ)*) • S: a start symbol
Context Free Grammar As an example, consider the classic unambiguous expression grammar: expr→term+expr expr→term term→term∗factor term→factor factor→(expr) factor→const const→integer
Context Free Grammar How do we know that 3 * 7 is a valid expression? Based on given the given grammar, we generate exprterm usingexpr→term; term * factor term→term∗factor;factor * factor term→factor;const * factor factor→const;const * constfactor→const;3 * constconst→integer;3 * 7 const→integer;
Formal Definition of Languages • Grammar is the formal language generation mechanism • Context-Free Grammars (CFG) • Backus-Naur Form (BNF) • Every rule in Backus-Naur form has the following structure: name::=expansion • A name is a non-terminal symbol. Every name in BNF is surrounded by angle bracket < > • An expansion is an expression containing terminal symbols and nontermain symbols, joined together by sequencing and choice |
For example, in BNF, the classic expression grammar is: <expr> ::= <term> "+" <expr> | <term> <term> ::= <factor> "*" <term> | <factor> <factor> ::= "(" <expr> ")" | <const> <const> ::= integer • CFG • expr→term+ expr • expr→term • term→term∗ factor • term→factor • factor→(expr) • factor→const • const→integer
Formal Definition of Languages • Grammar is the formal language generation mechanism • It is commonly used to describe the syntax of a programming language. • Two kinds of grammars • Context-Free Grammars (CFG) • Backus-Naur Form (BNF), Extended Backus-Naur Form (EBNF), Augmented BNF (ABNF)
Formal Definition of Languages • Backus-Naur Form (BNF), • Extended Backus-Naur Form (EBNF), • BNF with • option [...] such as <term> ::= [ "-" ] <factor>, • repetition {…} such as <args> ::= <arg> { "," <arg> } , • grouping (..) such as <expr> ::= <term> ("+" | "-") <expr> , and • concatenation which use “,” operator explicitly to denotes concatenation rather than relying on juxtaposition. <args> ::= <arg> { "," <arg> } • Augmented BNF (ABNF)
Describing Syntax: BNF and Context-Free Grammars • Context-Free Grammars (CFG) • Developed by Chomsky in the mid-1950s • Define a class of languages called context-free languages • Backus-Naur Form (BNF) • Invented by John Backus to describe Algol 58 • BNF is nearly identical to context-free grammars • BNF is ametalanguage a language to describe another language
BNF Fundamentals • Some entities are commonly used by BNF • A Terminal symbol is used to representlexemes or tokens • ANonterminalsymbol is often enclosed in angle brackets • Examples of BNF rules: <ident_list> → identifier | identifier, <ident_list> <if_stmt> → if <logic_expr> then <stmt> • A production rule defines one non-terminals in terms of terminals and/or other non-terminals • Left-hand side (LHS), which is a nonterminal, • Right-hand side (RHS), which is a string of terminals and/or nonterminals • Grammar: a finite non-empty set of rules • A start symbol is the starting element of a grammar, which is an non-terminal
BNF Illustration • LHS and RHS of a Rule • An abstraction (or nonterminal symbol) can have more than one RHS <stmt> <single_stmt> | begin <stmt_list> end Rules Terminal Nonterminal <…> Derivation
Production Rule Example • The rule that defines the syntactic class of a whilestatement <while_stmt> while( <logic_expr> ) <stmt> • How many nonterminals and terminals in the RHS? • The RHS above consists of 3 terminals (tokens) and 2nonterminals (syntactic classes) • Terminals: while, (, and ) • Nonterminals: <logic_expr> and <stmt> LHS (the syntactic class) RHS (the definition of LHS)
Describing Lists • What can we learn from the following rule? <ident_list> ident |ident, <ident_list> • A nonterminal may have multiple distinct definitions. Use the vertical bar ‘|’, which read as “OR” • Syntactic lists can use recursion
A SampleGrammar <program> <stmts> <stmts> <stmt> | <stmt> ; <stmts> <stmt> <var> = <expr> <var> a | b | c | d <expr> <term> + <term> | <term> - <term> <term> <var> | integer • Any nonterminal appearing on the RHS of a production rule needs to be defined with some production rules • - and thus appear on a LHS elsewhere • - In this grammar, integer is a token representing any integer lexeme. Integer and constant are interchangable.
A Formal Definition of Grammar • A grammar G = ( T, N, P, S ), where • T is a finite nonempty set of terminal symbols • N is a finite nonempty set of non-terminal symbols • P is a finite nonempty set of production rules • S N is a start symbol representing a complete sentence • The start symbol is typically named <program> • Generation of a sentence is called a derivation • A derivation is a repetitive application of rules • Starting from the start non-terminal symbol, and • Ending with a sentence (all terminal symbols) • A complete sentence consists of all terminal symbols • The language of L is L(G) = { *, S => }. is a sentence containing any symbols in terminal symbols T.
A Sample Derivation <program> => <stmts> => <stmt> => <var> = <expr> => a = <expr> => a = <term> + <term> => a = <var> + <term> => a = b + <term> => a = b + const Read as “DERIVES”
Derivations • Every string of symbols in a derivation is a sentential form. Symbols are Nonterminal and terminal symbols • A sentence is a sentential form that contains only terminal symbols • A leftmost derivation always choose the leftmost nonterminal in a sentential to expand • A derivation may be leftmost, or rightmost, or neither • Derivation order should notaffect the language that is generated by a grammar
Leftmost Vs. Rightmost derivation • Here is the definition of a grammar <expr> <expr> + <term> | <term> <term> <term> / integer | integer • Given a statement: • Integer + integer / integer Integer and const (i.e., constant), are interchagable Tokens Leftmost Derivation <expr> => <expr> + <term> => <term> + <term> => integer + <term> => integer + <term> / integer => integer + integer/ integer Rightmost Derivation <expr> => <expr> + <term> => <expr> + <term> / integer => <expr> +integer / integer => <term> + integer / integer => integer+ integer / integer
A Sample Grammar <program> <stmts> <stmts> <stmt> | <stmt> ; <stmts> <stmt> <var> = <expr> <var> a | b | c | d <expr> <term> + <term> | <term> - <term> <term> <var> | integer • Any nonterminal appearing on the RHS of a production rule needs to be defined with some production rules • - and thus appear on a LHS elsewhere • - In this grammar, integer is a token representing any integer lexeme
Parse Tree • Parse treeis a hierarchical representation of a sequence of derivations • Each internal node is labeled with a non-terminal symbol • Each leaf node is labeled with a terminal symbol • A grammar is ambiguous if and only if it can generate two or more distinct parse trees, given one sentence <program> <stmts> <stmt> <var> = <expr> a <term> + <term> <var> const b
Exercise (I) Given the following grammar • Draw the parse tree for the statement • A = B + C * A • B = C * (A * C + B) • Please show left-most derivation and the parse tree of it
Exercise (II) Given the following grammar • Draw the parse tree for the statement • A = B + C * A • B = C * (A * C + B) • Please do right-most derivation and the parse tree of it • Is this parse tree identical to the one in Exercise (I)?
Exercise (III) Given the following grammar • Draw the parse tree for the statement • A = B + C * A? • Please do left-most derivation and the parse tree of it • Is this parse tree identical to the one in Exercise (I) and (II)?
An Ambiguous Grammar <expr> <expr> <op> <expr> | integer <op> / | - 3– 4/ 5 <expr> <expr> <expr> <op> <expr> <expr> <op> <op> <expr> <expr> <op> <expr> <expr> <op> <expr> - - integer integer / integer integer integer / integer 4 5 3 4 5 3
Exercise (I) Prove the following grammar is ambiguous • <S> <A> • <A> <A>+<A> | <id> • <id> a | b | c <S> <A> <A> + <A> <A> + <A> <S> <A> <A> + <A> <A> + <A>
Ambiguity Example: “Dangling-else” • Consider the grammar <stmt> • • • | <if_stmt> | • • • <if_stmt> if <logic_expr> then <stmt> if <logic_expr> then <stmt> else <stmt> • How to derive the following statement? if<logic_expr> then if <logic_expr> then <stmt> else <stmt> • Draw the parse tree for the above statement Ambiguous or not? if<logic_expr> then if <logic_expr> then <stmt> else <stmt> if<logic_expr> then if <logic_expr> then <stmt> else <stmt>
Conquering “Dangling-else” problem if<logic_expr> then if <logic_expr> then <stmt> else <stmt> if<logic_expr> then if <logic_expr> then <stmt> else <stmt> • By default, most languages match each else with the nearest preceding elselessif • This is an ad-hoc rule outside the grammar that is used to resolve the ambiguity in the compiler • A programmer cannot know how this is resolved simply by studying the grammar • The ambiguity can be eliminated in the grammar by distinguishing elselessifs from ifs with else clauses • Adauses a different syntax for if-then-else to eliminate the ambiguity • Enforce end ifto indicate the end of the if
Ambiguity Problems • Syntactic ambiguity may cause problems • A compiler translates a statements by generating its parse tree • If the parse tree is not unique, the meaning (semantic) of the statement cannot be deterministic • Normally, an ambiguous grammar can be rewritten into an unambiguous grammar.
Two Important Issues of BNF • Previous BNF notation does not address two important issues, which are somehow considered as the causes to the ambiguity of a grammar • Operator precedence • Associativity
Operator Precedence and Associativity • Operator precedence • determines the order in which operators are evaluated. Operators with higher precedence are evaluated first • 3 + 4 * 5 // returns 23, rather than 35 • The multiplication operator (*) has higher precedence than the addition operator (+) • Associativity • It determines the order in which operators of the same precedence are processed • a OP b OP c
Operator Precedence Given the following grammar • Does this grammar indicates any precedence for operators? • Why? • Hint: what are the parse trees of left-most and right most derivation? • A = B + C * A
The Improved Version To parse A = B + C * A Rightmost Derivation Leftmost Derivation