1 / 12

Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt)

Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) Given tokens specified as regular expressions, Lex automatically generates a routine (yylex) that recognizes the tokens (and performs corresponding actions). Lex source program {definition} %% {rules} %% {user subroutines}

jensen
Download Presentation

Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) • Given tokens specified as regular expressions, Lex automatically generates a routine (yylex) that recognizes the tokens (and performs corresponding actions).

  2. Lex source program {definition} %% {rules} %% {user subroutines} Rules: <regular expression> <action> Each regular expression specifies a token. Default action for anything that is not matched: copy to the output Action: C/C++ code fragment specifying what to do when a token is recognized.

  3. lex program examples: ex1.l and ex2.l • ‘lex ex1.l’ produces the lex.yy.c file that contains a routine yylex(). • The int yylex() routine is the scanner that finds all the regular expressions specified. • yylex() returns a non-zero value (usually token id) normally. • yylex() returns 0 when end of file is reached. • Need a drive to test the routine. Main.cpp is an example. • You need to have a yywrap() function in the lex file (return 1), see the function in ex1.l. • Something to do with compiling multiple files.

  4. Lex regular expression: contains text characters and operators. • Letters of alphabet and digits are always text characters. • Regular expression integer matches the string “integer” • Operators: “\[]^-?.*+|()$/{}%<> • When these characters happen in a regular expression, they have special meanings • Lex regular expressions cannot have space in them!!!

  5. operators (characters that have special meanings): “\[]^-?.*+|()$/{}%<> • ‘*’, ‘+’, ‘|’, ‘(‘,’)’ -- used in regular expression • ‘ “ ‘ -- any character in between quote is a text character. • E.g.: “xyz++” == xyz”++” • ‘\’ -- escape character, • To get the operators back: “xyz++” == ?? • To specify special characters: \40 == “ “ • ‘[‘ and ‘]’ -- used to specify a set of characters • e.g: [a-z], [a-zA-Z], • Every character in it except ^, - and \ is a text character • [-+0-9], [\40-\176] • ‘^’ -- not, used as the first character after the left bracket • E.g [^abc] – any character except ‘a’, ‘b’ or ‘c’. • [^a-zA-Z] -- ??

  6. operators (characters that have special meanings): “\[]^-?.*+|()$/{}%<> • ‘.’ -- every character • ‘?’ -- optional ab?c matches ‘ac’ or ‘abc’ • ‘/’ -- used in character lookahead: • e.g. ab/cd -- matches ab only if it is followed by cd • ‘{‘’}’ -- enclose a regular definition • ‘%’ -- has special meaning in lex • ‘$’ -- match the end of a line, ‘^’ -- match the beginning of a line • ab$ == ab/\n • ‘<‘ ‘>’: start condition (more context sensitivity support, see the paper for details).

  7. Order of pattern matching: • Always matches the longest pattern. • When multiple patterns matches, use the first pattern. • To override, add “REJECT” in the action. ... %% Ab {printf(“rule 1\n”);} Abc {printf(“rule 2\n”);} {letter}{letter|digit}* {printf(“rule 3\n”);} %% Input: Abc What happened when at ‘.*’ as a pattern? Should regular expressions for reserved words happen before or after the regular expression for identifier?

  8. Manipulate the lexeme and/or the input stream: • yytext -- a char pointer pointing to the matched C string • yyleng -- the length of the matched string • I/O routines to manipulate the input stream: • yyinput() -- get a character from the input character, return <=0 when reaching the end of the input stream, the character otherwise • yyinput() for c++, input() for c. • unput( c ) -- put c back onto the input stream • Deal with comments: (/* ….. */ • Why is pattern “/*”.*”*/” a problem? %% … “/*” {char c1; c2 = yyinput(); if (c2 <=0) {lex_error(“unfinished comment” …} else { c1 = c2; c2 = yyinput(); while (((c1!=‘*’) || (c2 != ‘/’)) && (c2 > 0)) {c1 = c2; c2 = yyinput();} if (c2 <= 0) {lex_error( ….) }

  9. Reporting errors: • What kind of errors? Not too many. • Characters that cannot lead to a token • unended comments (can we do it in later phases?) • unended string constants.

  10. Reporting errors: • How to keep track of current position (which line, which column)? • Use to global variable for this: yyline, yycolumn %{ int yyline = 1, yycolumn = 1; %} ... %% “\n” {yyline++; yycolumn = 0;} [ \t]+ {/* do nothing*/ yycolumn += yyleng;} If {yycolumn += yyleng; return (IFNumber);} “+” {yycolumn += 1; return (PLUSNumber);} {letter}{letter|digit}* {yylval = idtable_insert(yytext); yycolumn += yyleng; return(IDNumber);} ... %%

  11. Dealing with identifiers, string constants. • Data structures: • Put the lexeme in a string table: e.g. vector of C strings. • See token.l • Recognizing constant strings with special characters • Assuming string cannot pass line boundary. • Use yymore() “[^”\n]* {char c; c = yyinput(); if (c != ‘”’) error else if (yytext[yyleng-1] == ‘\\’) { unput( c ); yymore(); } else {/* find the whole string, normal process*/}

  12. Put it all together • Checkout token.l and main.cpp in lex2.tar

More Related