1 / 80

Lecture 4 Lexical Analysis

Lecture 4 Lexical Analysis. Xiaoyin Wang CS5363 Programming Languages and Compilers. Last Class. Finite State Machines From Grammar to FSM DFA vs. NFA. Today’s Class. Scanner Concepts Tokens Basic Strategy Implementation Lexical Errors. Scanner Example. Input text

mvirginia
Download Presentation

Lecture 4 Lexical Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 4 Lexical Analysis Xiaoyin Wang CS5363 Programming Languages and Compilers

  2. Last Class • Finite State Machines • From Grammar to FSM • DFA vs. NFA

  3. Today’s Class • Scanner Concepts • Tokens • Basic Strategy • Implementation • Lexical Errors

  4. Scanner Example • Input text • // this statement does very little • if (x >= y) y = 42; • Token Stream • Note: tokens are atomic items, not character strings IF LPAREN ID(x) GEQ ID(y) ID(y) BECOMES INT(42) SCOLON RPAREN

  5. token source program get next token lexical analyzer symbol table parser Lexical Analyzer in Perspective Important Issue: What are Responsibilities of each Box ? Focus on Lexical Analyzer and Parser

  6. The Input • Read string input • Might be sequence of characters (Unix) • Might be sequence of lines (VMS) • Character set • ASCII • ISO Latin-1 • ISO 10646 (16-bit = unicode) • Others (EBCDIC, JIS, etc)

  7. The Output • A series of tokens • Punctuation ( ) ; , [ ] • Operators + - ** := • Keywords begin end if • Identifiers Square_Root • String literals “hello this is a string” • Character literals ‘x’ • Numeric literals 123 4_5.23e+2 16#ac#

  8. Lexical Analysis: Terminology token: a name for a set of input strings with related structure. Example: “identifier,” “integer constant” pattern: a rule describing the set of strings associated with a token. Example: “a letter followed by zero or more letters, digits, or underscores.” lexeme: the actual input string that matches a pattern. Example: count

  9. Examples Input: count = 123 Tokens: identifier : Rule: “letter followed by …” Lexeme: count assg_op : Rule: = Lexeme: = integer_const : Rule: “digit followed by …” Lexeme: 123

  10. Attributes for Tokens If more than one lexeme can match the pattern for a token, the scanner must indicate the actual lexeme that matched. This information is given using an attribute associated with the token. Example: The program statement count = 123 yields the following token-attribute pairs: identifier,pointer to the string “count”, … assg_op, … integer_const,the integer value 123, …

  11. Today’s Class • Scanner Concepts • Tokens • Basic Strategy • Implementation • Lexical Errors

  12. Token Sample Lexemes Informal Description of Pattern const if relation id num literal const if <, <=, =, < >, >, >= pi, count, D2 3.1416, 0, 6.02E23 “core dumped” const if < or <= or = or < > or >= or > letter followed by letters and digits any numeric constant any characters between “ and “ except “ Actual values are critical. Info is : 1. Stored in symbol table 2. Returned to parser Classifies Pattern Introducing Basic Terminology

  13. Punctuation • Typically individual special characters • Such as ‘(‘, ‘)’ • Sometimes double characters • E.g. (* treated as a kind of bracket • Returned just as identity of token • And perhaps location: For error message and debugging purposes

  14. Operators • Like punctuation • No real difference for lexical analyzer • Typically single or double special chars • Operators + - • Operations := • Returned just as identity of token • And perhaps location

  15. Keywords • Reserved identifiers • E.g. BEGIN END in Pascal, if in C • Returned just as token identity • With possible location information • Unreserved keywords (e.g. PL/1) • Handled as identifiers (parser distinguishes)

  16. Identifiers • Rules differ • Length, allowed characters, separators • Need to build table • So that a1 is recognized as a1 • Typical structure: hash table • Lexical analyzer returns token type • And key to table entry • Table entry includes location information

  17. Numeric Literals • Also need a table • Typically record value • E.g. 123 = 0123 = 01_23 (Ada) • But usually do not use int for values • Because may have different characteristics • Float stuff much more complex • Denormal numbers, correct rounding • Very delicate stuff

  18. String Literals • Text must be stored • Actual characters are important • Not like identifiers • Character set issues • Table needed • Lexical analyzer returns key to table • May or may not be worth hashing

  19. Character Literals • Similar issues to string literals • Lexical Analyzer returns • Token type • Identity of character • Note, cannot assume character set of host machine, may be different

  20. Handling Comments • Comments have no effect on program • Can therefore be eliminated by scanner • But may need to be retrieved by tools • Error detection issues • E.g. unclosed comments • Scanner does not return comments

  21. Case Sensitiveness • Some languages have case equivalence • Pascal, Ada • Some do not • C, Java • Lexical analyzer ignores case if needed • This_Routine = THIS_RouTine • Error analysis may need exact casing

  22. Today’s Class • Scanner • Tokens • Basic Strategy • Implementation • Lexical Errors

  23. Basic Strategy • Parse input to generate the token stream • Model the whole language as a regular language • Parse the input!

  24. Basic Strategy • Each token can be modeled with a finite state machine • Language = Token*

  25. Example of Token Models <=

  26. Example of Token Models Number literal

  27. Example of Token Models Number literal

  28. Combine All Models Token* ...

  29. Basic Strategy • However, this will not work … • The reason • The grammar is ambiguous • One token can be prefix of the other • ‘abc’ can be • ‘a’ and ‘bc’ | ‘ab’ and ‘c’ | ‘abc’ • ‘<=’ can be • ‘<’ and ‘=’ | ‘<=’

  30. Basic Strategy • The rule to remove ambiguity • Always identify the shortest token • Basic strategy works, but programs are hard to write • Always identify the longest token • We need to revise the basic strategy • Add a ‘backspace’ operation after finding a token

  31. Combine All Models with Backspace Token* ...

  32. q1 q2 q7 q6 q3 q4 q5 An Example 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/)

  33. q1 q2 q7 q6 q3 q4 q5 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/)

  34. q1 q2 q7 q6 q3 q4 q5 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/)

  35. q1 q2 q7 q6 q3 q4 q5 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF

  36. q1 q2 q7 q6 q3 q4 q5 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF

  37. q1 q2 q7 q6 q3 q4 q5 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF (

  38. q1 q2 q7 q6 q3 q4 q5 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF (

  39. q1 q2 q7 q6 q3 q4 q5 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF (

  40. q1 q7 q5 q3 q4 q2 q6 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2

  41. q1 q7 q5 q3 q4 q2 q6 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2

  42. q1 q7 q5 q3 q4 q2 q6 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2

  43. q1 q7 q5 q3 q4 q2 q6 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2 >=

  44. q1 q7 q5 q3 q4 q2 q6 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2 >=

  45. q1 q7 q5 q3 q4 q2 q6 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2 >=

  46. q1 q7 q6 q4 q3 q5 q2 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2 >= Num: 35

  47. q1 q7 q6 q4 q3 q5 q2 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2 >= Num: 35

  48. q1 q7 q6 q4 q3 q5 q2 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2 >= Num: 35 )

  49. Transition diagrams Transition diagram for relop

  50. Transition diagrams (cont.) Transition diagram for reserved words and identifiers

More Related