1 / 41

Topic #3: Lexical Analysis

Topic #3: Lexical Analysis. EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003. Lexical Analyzer and Parser. Why Separate?. Reasons to separate lexical analysis from parsing: Simpler design Improved efficiency Portability

ayasha
Download Presentation

Topic #3: Lexical Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003

  2. Lexical Analyzer and Parser

  3. Why Separate? • Reasons to separate lexical analysis from parsing: • Simpler design • Improved efficiency • Portability • Tools exist to help implement lexical analyzers and parsers independently

  4. Tokens, Lexemes, and Patterns • Tokens include keywords, operators, identifiers, constants, literal strings, punctuation symbols • A lexeme is a sequence of characters in the source program representing a token • A pattern is a rule describing a set of lexemes that can represent a particular token

  5. Attributes • Attributes provide additional information about tokens • Technically speaking, lexical analyzers usually provide a single attribute per token (might be pointer into symbol table)

  6. Buffer • Most lexical analyzers use a buffer • Often buffers are divided into two N character halves • Two pointers used to indicate start and end of lexeme • If pointer walks past end of either half of buffer, other half of buffer is reloaded • A sentinel character can be used to decrease number of checks necessary

  7. Strings and Languages • Alphabet – any finite set of symbols (e.g. ASCII, binary alphabet, or a set of tokens) • String – A finite sequence of symbols drawn from an alphabet • Language – A set of strings over a fixed alphabet • Other terms relating to strings: prefix; suffix; substring; proper prefix, suffix, or substring (non-empty, not entire string); subsequence

  8. Operations on Languages • Union: • Concatenation: • Kleene closure: • Zero or more concatenations • Positive closure: • One or more concatenations

  9. Regular Expressions • Defined over an alphabet Σ • ε represents {ε}, the set containing the empty string • If a is a symbol in Σ, then a is a regular expression denoting {a}, the set containing the string a • If r and s are regular expressions denoting the languages L(r) and L(s), then: • (r)|(s) is a regular expression denoting L(r)UL(s) • (r)(s) is a regular expression denoting L(r)L(s) • (r)* is a regular expression denoting (L(r))* • (r) is a regular expression denoting L(r) • Precedence: * (left associative), then concatenation (left associative), then | (left associative)

  10. Regular Definitions • Can give “names” to regular expressions • Convention: names in boldface (to distinguish them from symbols) letter A|B|…|Z|a|b|…|z digit 0|1|…|9 idletter (letter | digit)*

  11. Notational Shorthands • One or more instances: r+ denotes rr* • Zero or one Instance: r? denotes r|ε • Character classes: [a-z] denotes [a|b|…|z] digit [0-9] digits  digit+ optional_fraction (. digits )? optional_exponent  (E(+|-)? digits )? numdigitsoptional_fractionoptional_exponent

  12. Limitations • Can not describe balanced or nested constructs • Example, all valid strings of balanced parentheses • This can be done with CFG • Can not describe repeated strings • Example: {wcw|w is a string of a’s and b’s} • Can not denote with CFG either!

  13. Grammar Fragment (Pascal) stmt ifexprthenstmt | ifexprthenstmtelsestmt | ε expr  termrelopterm | term term  id | num

  14. Related Regular Definitions if if then then else else relop < | <= | = | <> | > | >= idletter ( letter | digit )* numdigit+ (. digit+ )? (E(+|-)? digit+ )? delim  blank | tab | newline ws  delim+

  15. Tokens and Attributes

  16. Transition Diagrams • A stylized flowchart • Transition diagrams consist of states connected by edges • Edges leaving a state s are labeled with input characters that may occur after reaching state s • Assumed to be deterministic • There is one start state and at least one accepting (final) state • Some states may have associated actions • At some final states, need to retract a character

  17. Transition Diagram for “relop”

  18. Identifiers and Keywords • Share a transition diagram • After reaching accepting state, code determines if lexeme is keyword or identifier • Easier than encoding exceptions in diagram • Simple technique is to appropriately initialize symbol table with keywords

  19. Numbers

  20. Order of Transition Diagrams • Transition diagrams tested in order • Diagrams with low numbered start states tried before diagrams with high numbered start states • Order influences efficiency of lexical analyzer

  21. Trying Transition Diagrams int next_td(void) { switch (start) { case 0: start = 9; break; case 9: start = 12; break; case 12: start = 20; break; case 20: start = 25; break; case 25: recover(); break; default: error("invalid start state"); } /* Possibly additional actions here */ return start; }

  22. Finding the Next Token token nexttoken(void) { while (1) { switch (state) { case 0: c = nextchar(); if (c == ' ' || c=='\t' || c == '\n') { state = 0; lexeme_beginning++; } else if (c == '<') state = 1; else if (c == '=') state = 5 else if (c == '>') state = 6 else state = next_td(); break; … /* 27 other cases here */

  23. The End of a Token token nexttoken(void) { while (1) { switch (state) { … /* First 19 cases */ case 19: retract(); install_num(); return(NUM); break; … /* Final 8 cases */

  24. Finite Automata • Generalized transition diagrams that act as “recognizer” for a language • Can be nondeterministic (NFA) or deterministic (DFA) • NFAs can have ε-transitions, DFAs can not • NFAs can have multiple edges with same symbol leaving a state, DFAs can not • Both can recognize exactly what regular expressions can denote

  25. NFAs • A set of states S • A set of input symbols Σ (input alphabet) • A transition function move that maps state, symbol pairs to a set of states • A single start state s0 • A set of accepting (or final) states F • An NFA accepts a string s if and only if there exists a path from the start state to an accepting state such that the edge labels spell out s

  26. Transition Tables

  27. DFAs • No state has an ε-transition • For each state s and input symbol a, there as at most one edge labeled a leaving s

  28. Thompson’s Construction • Method of converting a regular expression into an NFA • Start with two simple rules • For ε, construct NFA: • For each a in Σ, construct NFA: • Next will inductively apply a more complex rule until entire we obtain NFA for entire expression

  29. Complex Rule, Part 1 • For the regular expression s|t, such that N(s) and N(t) are NFAs for s and t, construct the following NFA N(s|t):

  30. N(S) N(T) Complex Rule, Part 2 • For the regular expression st, construct the composite NFA N(st):

  31. e Complex Rule, Part 3 • For the regular expression s*, construct the composite NFA N(s*):

  32. Complex Rule, Part 4 • For the parenthesized regular expression (s), use N(s) itself as the NFA

  33. Example: r = (a|b)*abb

  34. Functions ε-closure and move • ε-closure(s) is the set of NFA states reachable from NFA state s on ε-transitions alone • ε-closure(T) is the set of NFA states reachable from any NFA state s in T on ε-transitions alone • move(T,a) is the set of NFA states to which there is a transition on input a from any NFA state s in T

  35. Computing ε-closure push all states in T onto stack initialize ε-closure(T) to T while stack is not empty pop t from top of stack for each state u with an ε-transition from t if u is not in ε-closure(T) then add u to ε-closure(T) push u onto stack

  36. Subset Construction (NFA to DFA) initialize Dstates to unmarked ε-closure(s0) while there is an unmarked state T in Dstates mark T for each input symbol a U := ε-closure(move(T,a)) if U is not in Dstates add U as unmarked state to Dstates Dtran[T,a] := U

  37. Constructed DFA

  38. Simulating a DFA s := s0 c := nextchar while c != eof do s := move(s, c) c := nextchar end if s is in F then return “yes” else return “no”

  39. Simulating an NFA S := ε-closure({s0}) a := nextchar while a != eof do S := ε-closure(move(S,a)) a := nextchar if S ∩ F != Ø return “yes” else return “no”

  40. Space/Time Tradeoff (Worst Case)

  41. Simulating a Regular Expression • First use Thompson’s Construction to convert RE to NFA • Then there are two choices: • Use subset construction to convert NFA to DFA, then simulate the DFA • Simulate the NFA directly • You won’t have to worry about any of this while programming, Lex will take care of it!

More Related