Two issues in lexical analysis Specifying tokens (regular expression) – last lecture

Two issues in lexical analysis • Specifying tokens (regular expression) – last lecture • Identifying tokens specified by regular expression – today’s topic

How to recognize tokens specified by regular expressions? • A recognizer for a language is a program that takes a string x as input and answers “yes” if x is a sentence of the language and “no” otherwise. • In the context of lexical analysis, given a string and a regular expression, a recognizer of the language specified by the regular expression answer “yes” if the string is in the language. • How to recognize regular exp int? What about int | for? 2 3 1 0 2 3 1 i n t 0 i n t All other characters error

How to recognize tokens specified by regular expressions? • How to recognize regular expression int | for? • A regular expression can be compiled into a recognizer (automatically) by constructing a finite automata which can be deterministic or non-deterministic. 2 3 1 0 i n t f 5 6 4 o r

Non-deterministic finite automata (NFA) • A non-deterministic finite automata (NFA) is a mathematical model that consists of: (a 5-tuple • a set of states Q • a set of input symbols • a transition function that maps state-symbol pairs to sets of states. • A state q0 that is distinguished as the start (initial) state • A set of states F distinguished as accepting (final) states. • An NFA accepts an input string x if and only if there is some path in the transition graph from the start state to some accepting state (after consuming x).

Finite State Machines = Regular Expression Recognizers relop < |<= |<> |> |>= |= start < = 0 1 2 return(relop, LE) > 3 return(relop, NE) other * 4 return(relop, LT) = return(relop, EQ) 5 > = 6 7 return(relop, GE) other * 8 return(relop, GT) id letter ( letter | digit )* letter or digit start letter other * 9 10 11 return(gettoken(),install_id())

An NFA is non-deterministic in that (1) same character can label two or more transitions out of one state (2) empty string can label transitions. • For example, here is an NFA that recognizes the language ? • An NFA can easily implemented using a transition table, which can be used in the language recognizer. State a b 0 {0, 1} {0} 1 - {2} 2 - {3} a 2 3 1 0 a b b b

From a Regular Expression to an NFA  start  i f start a a i f   N(r1) start r1r2 i f   N(r2) start r1r2 i N(r1) N(r2) f    start r* i N(r) f 

Examples: • a • a | b • ab • a*b • (a|b)*abb. • NFA can be converted into deterministic finite state automata (DFA) to improve efficiency.

Two issues in lexical analysis Specifying tokens (regular expression) – last lecture