1 / 21

Lexical Analysis (4.2)

Lexical Analysis (4.2). Programming Languages Hiram College Ellen Walker. Lexical Analysis is Pattern Matching. From a sequence of characters to a sequence of lexemes, e.g. “public static void main(char[] args)” -> <id> <id> <id> <id> <lparen> <id> <lsquare> <rsquare> <id> <rparen>

amauzy
Download Presentation

Lexical Analysis (4.2)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

  2. Lexical Analysis is Pattern Matching • From a sequence of characters to a sequence of lexemes, e.g. • “public static void main(char[] args)” -> • <id> <id> <id> <id> <lparen> <id> <lsquare> <rsquare> <id> <rparen> • Patterns are simpler (easy grammars), e.g. <id> -> <letter> <id> | <letter> <letter> -> a | b | c | … | z

  3. Regular Grammars • Subset of Context Free Grammars • Every rule contains at most one non-terminal symbol (or can be rewritten so it does…)

  4. Rewritten Grammar for ID • Original: <id> -> <letter> <id> | <letter> <letter> -> a | b | c | … | z • Rewrite: <id> -> (a | b | c | … | z) <id> | (a | b | c | … z ) • Fully expanded (52 rules): <id> -> a <id> | b <id> | c <id> … a | b | c |… | z

  5. Parsing using a Regular Grammar • Transform the grammar into a state machine • Implement the state machine in a computer program • By hand • Automatically, using table-lookup • Run this program on input strings

  6. What is a State Machine? • State machine abstraction • At any time, the process is in a “state” • Each time an “event” happens, the process takes an “action” and goes to the next state • We can describe the entire algorithm as a diagram where each state has an arrow for each event/action pair to the next appropriate state

  7. State Machine for a Kitten Happy Food available / Eat Toys available / Play Sleeping Hungry X hrs passed / Awaken

  8. State Machine for a Language • Each “event” processes an input symbol • Two important special states • Initial state: state the machine is in before the first symbol • Final state: state the machine is in whenever the sequence of symbols up to now is in the language

  9. Transforming a Regular Grammar to a State Machine • Put the grammar into a form so every rule is <nonterm1> -> symbol <nonterm2> <nonterm1> -> symbol • Make a state for each nonterminal • Make a transition (arrow) for each rule. The transition goes from <nonterm1> to <nonterm2> based on the symbol. • The start symbol of the grammar is initial. • There is one final state that every rule that doesn’t have a nonterminal on the right goes to.

  10. State Machine Example • <id> -> a <id> | b <id> | a | b • Two states: id (initial) and f (final) • Example: aabba

  11. Simpler State Machine • This is a cleaner version of the other machine. Each character, state combination has only one next state. • It is called a DFA (deterministic finite automaton)

  12. Lexical Analysis for Integer Expressions

  13. From DFA to Program • Method doScan() reads tokens from an input stream (assume System.in for now) and creates a list of them in order. • Method lex(s) scans and returns a single Token from a stream. • A Token consists of a type (e.g. INT) and a string (e.g. “1234”)

  14. Defining Constants • //Number all the states • Public static final int NUMSTATES = 4; • Public static final int START = 0; • Public static final int INT = 1; • Public static final int ID = 2; • Public static final int UNK = 3; • Public static final int ERR = 4;

  15. Constructing Transition Table (in constructor) String chars = “01234abcdef+-()” int[][] tt = new int[[chars.size()][NUMSTATES]; tt[ID][5] = ID; // ’a’ tt[ID][6] = ID; // ’b’ tt[START][5] = ID; // ’a’ tt[START][1] = INT; // … etc … tt[ID][0] = ERR; // … etc …

  16. Recognizing Final States • //For this grammar, all states but ERR are final • //Usually, this method is a bit more complex • boolean final(int state){ • return (state != ERR); • }

  17. Lex Method • //Read one token from the input ( any Scanner) • public static Token lex(Scanner s){ • //initialize variables • StringBuilder lexeme = new StringBuilder; • int state = START; • char ch = s.nextChar(); • …

  18. Lex Method (cont’d) • //loop through characters, updating state • while (state != ERR){ • oldstate = state; • lexeme += ch; • state = tt[oldstate][chars.indexOf(ch)]; • ch = s.getChar(); • }

  19. Lex Method (cont’d) • //return the token • if final(oldstate) //valid token • return new Token(oldstate,lexeme); • else //not a valid token – return the chars • return new Token(ERR, lexeme); • } //end of lex()

  20. From DFA to Program (cont’d) Public static boolean doScan(){ Scanner s = new Scanner (System.in); while(s.peek()){ //not EOF //removes whitespace eatWhitespace(s); token = lex(s); tokens.add(token); if (token.getType == ERR) return false; } return true;

  21. Another Program (pp. 176-181) • Programmed in C (no classes) • Global variables instead of class variables (used in many functions, e.g. charClass) • Token (int) and lexeme (string) unconnected • States and transitions are implicit • Lex() is a big case statement • Many special purpose functions, e.g. getChar(), addChar(), lookup() executing portions of DFA

More Related