1 / 41

Lexical Analysis - Scanner

Lexical Analysis - Scanner. Lexical Analyzer reads source text and produces tokens, which are the basic lexical units of the language. Example: System.out.println(“Hello Class”);

zuri
Download Presentation

Lexical Analysis - Scanner

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical Analysis - Scanner • Lexical Analyzer reads source text and produces tokens, which are the basic lexical units of the language. • Example: System.out.println(“Hello Class”); Has tokens System, dot, out, dot, println, left parenthesis, String: Hello Class, right parenthesis and a semicolon.

  2. Role of Lexical Analyzer/Scanner (1) • Lexical Analyzer also keeps track of the source-coordinates of each token - which file name, line number and position. This is useful for debugging purposes. • Lexical Analyzer is the only part of a compiler that looks at each character of the source text.

  3. Role of Lexical Analyzer/Scanner (2) • The scanner also removes white space • White space: characters that are ignored between tokens • Ex: spaces, tabs, newlines • Definitions of tokens and white space vary among languages • The lexical analyzer also filter comments

  4. Why scanner is a separated phase? • Simpler design: based on • Token description using regular grammars or regular definitions • Elimination of white spaces and comments. How about if they are part of the parser? • Distinguishing between keywords and identifiers • Faster: since lexical is time-consuming in many compilers (40-60% ?), restricting the job of the lexical analyzer, a faster implementation is usually feasible using buffering techniques • Portability is enhanced: the representation of non standard symbols can be isolated in this phase

  5. Difficulties of lexical analysis • Blanks existence in language statements example in FORTRAN: • Do 5 I = 1.25 Is it a DO loop or what? • Formatted paragraph lines like COBOL: now almost all programming languages are format free. • Using key words as identifiers example in PL1 if then then then else else = then

  6. Lexical Errors • Lexical errors are violations of token definitions. For example using a$ as a char identifier in languages other than BASIC • Not all errors are detected by the scanner • fi (x) the scanner can not tell whether fi is if or it is a function name • There are different strategies for error handling errors. • Deleting characters until a token is recognized • Inserting a missing character • Replacing incorrect character with a correct one

  7. Recognizing Tokens • How tokens are defined and recognized? By using regular expressions to define a token as a formal regular language. • What are regular expressions and how they are defined? There are simple and structured regular expressions. Each regular expression E denotes a language L(E). What is a language? • Let us look at some basic definitions.

  8. Formal Language concepts • Alphabet - a finite set of symbols, ASCII is a computer alphabet. • String - finite sequence of symbols from the alphabet. • Empty string = special string of length 0 • Language = set of strings over a given alphabet (e.g., set of all programs)

  9. Language opreations • Let L1 and L2 be two languages over some alphabet Σ, then the following is a list of language opreations: • L1|L2 (union) is = { s Є L1 or t Є L2} • L1L2 (concatenation) is = {st such that s Є L1 & t Є L2}. We use the notation L2 to denote LL and L0 to denote the language consisting of one symbol the null symbol {NULL}. • L* = L0∨L∨L2∨ … ∨Li∨… • L+ = L∨L2∨ … ∨Li∨…

  10. Regular Expressions • Simple regular expressions are • An alphabet symbol, a, is a regular expression defines {a}. • An empty symbol is also a regular expression defines {Є}. • 2. Structured regular expressions: If E1 and E2 are regular expressions denoting languages L(E1) and L(E2), then • E1 | E2 is a regular expression denoting a language • L(E1) union L(E2). • E1 E2 is a regular expression denoting a language L(E1)followed by L(E2). • E* (E star) is a regular expression denoting L(E star) = • Kleene closure of L(E).

  11. Names of regular expression • A collection of regular expression can be given a name for example: • If the alphabet Σ = {0, 1, 2, …,9}, then each digit is a RE. • To specify any digit we use 0|1|…|9 or we can give it a name such as digit  0|1|…|9 so digit can be considered as a RE • Also two_digits  digit digit and so on

  12. Examples • Specify a set of unsigned numbers as a regular expression. Examples: 1997, 19.97 • Solution: Note use of regular definitions as intermediate names that define regular subexpressions. • digit 0 | 1 | 2| 3| … | 9 • digits digit digit* (often written as digit+) This is the Kleene closure. Means 1 or more digits.

  13. Example Contd • optional_fraction . digits | epsilon • Num digits optional_fraction • Note that we have used all the definitions of a regular expression. • One can define similar regular expression(s) for identifiers comments, Strings, operators and delimiters. • Question: How to write a regular expression for identifiers? (identifiers are letters followed by a letter or a digit).

  14. Identifiers contd letter a|A|b|B| … |z|Z digit 0|1|2| … | 9 letter | digit letter_or_digit identifier letter | letter letter_or_digit*

  15. Building a recognizer A General Approach • Build Nondeterministic Finite Automaton (NFA) from Regular Expression E. • Simulate execution of NFA to determine whether an input string belongs to L(E). • The simulation can be much simplified if you convert your NFA to Deterministic Finite Automaton (DFA).

  16. NFA Definition • A NFA is a system consists of five components: • An input alphabet Σ. • A set of states S. • Designation of one state i a start state • A subset F of S containing final states • A transition function δ: Σv{Є}xS  D a subset of S.

  17. NFA A transition graph represents a NFA. • Nodes represent states. There is a distinguished start state and one or more final states. • Edges represent state transitions. • An edge can be labeled by an alphabet or an empty symbol

  18. NFA cont. • From a state(node), there may be more than one edge labeled with the same alphabet and there may be no edge from a node labeled with an input symbol. • NFA accepts an input string iff (if and only if) there is a path in the transition graph from the start node to some final state such that the labels along the edge spell out the input string.

  19. 1 0,1 0,1 s0 s1 s2 s3 0,1 Non deterministic automata Allows more than a single transition from a state with the same label. There does not have to be a transition from every state with every label. Allows multiple initial states. Allows  transitions.

  20. 1 0,1 0,1 s0 s1 s2 s3 0,1 Nondeterministic runs Input: 0100 Run 1: s0 0 s0 1 s0 0 s0 0 s0. Does not accept. Run 2: s0 0 s0 1 s1 0 s2 0 s3. Accepts. Accepts when there exists an accepting run.

  21. Deterministic Finite Automaton (DFA) A finite automaton is deterministic if • It has no edges/transitions labeled with epsilon. • For each state and for each symbol in the alphabet, there is at most one edge leaving that state & labeled with that symbol. Such a transition graph is called a state graph.

  22. DFA’s Cont. • NFAs are quicker to build but slower to simulate. • DFAs are slower to build but quicker to simulate. • The number of states in an NFA may be exponential in the number of states in a DFA.

  23. Deterministic Finite Automata Includes: States {s1,s2,…,s5}. Initial states {s1}. Accepting states {s3,s5}. Alphabet {a, b, c}. Transitions: {(s1,a,s2), (s2, a, s3), …}. s1 a b b s2 b a s5 c b a c s4 s3 a Deterministic?

  24. s0 s1 a b a Automaton. What is the language? b An input is a word over the alphabet . A run over a word is an alternating sequence ofstates and letters, starting from the initial state. Accepting run: ends with an accepting state.

  25. s0 s1 a b a Example b Input: aabbb Run: s0 a s0 a s0 b s1 b s1 b s1. Accepts. Input: aba Run: s0 a s0 b s1 a s0. Does not accept.

  26. Automaton. What is the language? b s0 s1 a b a

  27. Automaton. What is the language? b s1 a s0 b a

  28. Identifying tokens F I T H E N E L S E letter letter|digit

  29. 1 0,1 0,1 s0 s1 s2 s3 0,1 Determinizing Automata • Each state of D is a set of the states of N. • (S,a) T when T={t| sS and (s, a) t}. • The initial state of D includes the initial state of N. • Accepting states in D include at least one accepting state of N.

  30. 1 1 1 s0 s0,s1 s0,s1,s2 s0,s1,s2,s3 0 1 1 0 0 s0,s2 s0,s1,s3 0 0 0 s0,s2,s3 1 0 s0,s3 Determinization 1 0,1 0,1 s0 s1 s2 s3 0,1 1 0

  31. 1 1 1 000 001 011 111 0 1 1 0 0 010 101 0 0 0 110 1 0 100 Determinization 1 0

  32. Examples of Constructing NFA from a R.E • A NFA for a regular expression can be constructed as in the following cases: • For simple regular expressions r, the NFA is constructed by: • Constructing two states, the start state and the final state. • One edge/transition labeled with alphabet symbol r. (this includes an epsilon symbol).

  33. NFA Cont. 2. For E1E2 with NFA for each expression is constructed NFA(E1) and NFA(E2) • Construct a new start state and a new final state. • From the start state, add an edge labeled with epsilon to start state of NFA(E1). • From the final state of NFA(E1), add an epsilon transition to start state of NFA(E2). • Add a transition/edge from the final state of NFA(E2) to the constructed final state.

  34. NFA Cont. • 3. For E1|E2: • Construct new start state i, new final state f. • Add a transition from the start state i to the start states of NFA(E1) and NFA(E2). These transitions are labeled with epsilon symbol. • Add a transition from the final states of NFA(E1) &NFA(E2) to the final state f. These transitions are labeled with epsilon symbol.

  35. NFA Cont. 4. For E*, • Construct new start state i and new final state f. • Add an epsilon transition from the start state i to the start state of NFA(E), an epsilon transition from the final state of NFA(E) to f, and an epsilon transition from the final state f to the start state i.

  36. Translating regular expressions into automata L1     L1L2 L2 L1.L2 L1  L2   L   L*

  37. a   b   Automatic translation (a|b).(a.b)=(ab)(ab)=(a+b).(a+b) a b a  a     b b    

  38. s3,s5,s6,s7,s8 a a s9,s11 b s0,s1,s2 b a s4,s5,s6,s7,s8 s10,s11 b Determinization with  transitions. a  a    s7 s9 s1 s3  s5 s6 s11 s0 b b s8 s10 s2 s4     Add to each set states reachable using  transitions.

  39. Example • Let Σ be the ASCII char set, construct the NFA to recognizes each of the following: • Identifier • Integers • Floating constants.

  40. Simulation of NFA An epsilon closure of a state x is the set of states that can be reached (including itself) by making just transition labeled with epsilon. We want to get the next token from the input stream. Properties: 1. The longest sequence of characters starting at the current position that matches a regular exp. for a token. 2. Input buffer is repositioned to the first character following the token. 3. Nothing gets read after the end-of-file.

  41. Algorithm page 126 of text alg.3.3 getNextToken() { t.error = true; // t is a token that will be found S = epsilon_closure({start}); while(true) { if (S is empty) break; if (S contains a final state) { t.eror=false //fill in t.line and other attributes.} if (end_of_file) break; c= getchar(): T=move(S,c); S=epsilon_closure(T);} reset_inputbuffer(t.line,t.lastcol+1); return t}

More Related