230 likes | 418 Views
Using JavaCC. Professor Yihjia Tsai Tamkang University. String stream. Scanner generator. Java scanner program. NFA. RE. DFA. Minimize DFA. Simulate DFA. Automating Lexical Analysis Overall picture. Tokens. Building Faster Scanners from the DFA.
E N D
Using JavaCC Professor Yihjia Tsai Tamkang University
String stream Scanner generator Java scanner program NFA RE DFA Minimize DFA Simulate DFA Automating Lexical Analysis Overall picture Tokens
Building Faster Scanners from the DFA Table-driven recognizers waste a lot of effort • Read (& classify) the next character • Find the next state • Assign to the state variable • Branch back to the top We can do better • Encode state & actions in the code • Do transition tests locally • Generate ugly, spaghetti-like code (it is OK, this is automatically generated code) • Takes (many) fewer operations per input character state = s0 ; string = ; char = get_next_char(); while (char != eof) { state = (state,char); string = string + char; char = get_next_char(); } if (state in Final) then report acceptance; else report failure;
Inside lexical analyzer generator • How does a lexical analyzer work? • Get input from user who defines tokens in the form that is equivalent to regular grammar • Turn the regular grammar into a NFA • Convert the NFA into DFA • Generate the code that simulates the DFA
Flow for Using JavaCC Extracted from http://www.cs.unb.ca/profs/nickerson/courses/cs4905/Labs/L1_2006.pdf
Focus of this Lecture Focus of this Lecture Structure of a JavaCC File • A JavaCC file is composed of 3 portions: • Options • Class declaration • Specification for lexical analysis (tokens), and specification for syntax analysis. • For the very first example of JavaCC, let's recognize two tokens: ``+'', and numerals. • Use an editor to edit and save it with file name numeral.jj
Using javaCC for lexical analysis • javacc is a “top-down” parser generator. • Some parser generators (such as yacc , bison, and JavaCUP) need a separate lexical-analyzer generator. • With javaCC, you can specify the tokens within the parser generator.
Example File /* main class definition */ PARSER_BEGIN(Numeral) public class Numeral{ public static void main(String[] args) throws ParseException, TokenMgrError { Numeral numeral = new Numeral(System.in); while (numeral.getNextToken().kind!=EOF); } } PARSER_END(Numeral) /* token definitions */ TOKEN: { <ADD: "+"> | <NUMERAL: (["0"-"9"])+> }
Options • The options portion is optional and is omitted in the previous example. • STATIC is a boolean option whose default value is true. If true, all methods and class variables are specified as static in the generated parser and token manager. • This allows only one parser object to be present, but it improves the performance of the parser. • To perform multiple parses during one run of your Java program, you will have to call the ReInit() method to reinitialize your parser if it is static. • If the parser is non-static, you may use the "new" operator to construct as many parsers as you wish. These can all be used simultaneously from different threads.
Simple Loop Getting Tokens Start /* main class definition */ PARSER_BEGIN(Numeral) public class Numeral{ public static void main(String[] args) throws ParseException, TokenMgrError { Numeral numeral = new Numeral(System.in); while (numeral.getNextToken().kind!=EOF); } } PARSER_END(Numeral) /* token definitions */ TOKEN: { <ADD: "+"> | <NUMERAL: (["0"-"9"])+> }
After calling javacc to compile numeral.jj, eight files are generated if no error messages occur. They are Numeral.java, NumberalConstants.java, NumeralTokenManger.java, ParseException.java, SimpleCharStream.java, Token.java, and TokenMgrError.java. bash-2.05$ javacc numeral.jj Java Compiler Compiler Version 3.2 (Parser Generator) (type "javacc" with no arguments for help) Reading from file numeral.jj . . . File "TokenMgrError.java" does not exist. Will create one. File "ParseException.java" does not exist. Will create one. File "Token.java" does not exist. Will create one. File "SimpleCharStream.java" does not exist. Will create one. Parser generated successfully Compilation
javaCC specification of a lexer Note the need for ( )! Defining Whitespace
A Full Example See the sample file
Dealing with errors • Error reporting: 123e+q • Could consider it an invalid token (lexical error) or • return a sequence of valid tokens • 123, e, +, q, • and let the parser deal with the error.
Lexical error correction? • Sometimes interaction between the Scanner and parser can help • especially in a top-down (predictive) parse • The parser, when it calls the scanner, can pass as an argument the set of allowable tokens. • Suppose the Scanner sees calss in a context where only a top-level definition is allowed.
Same symbol, different meaning. • How can the scanner distinguish between binary minus and unary minus? • x = -a; vsx = 3 – a;
Scanner “troublemakers” • Unclosed strings • Unclosed comments.
Javacc Overview • Generates a top down parser. • Could be used for generating a Prolog parser which is in LL. • Generates a parser in Java. • Hence can be integrated with any Java based Prolog compiler/interpreter to continue our example. • Token specification and grammar specification structures are in the same file => easier to debug.
Types of Productions in Javacc There can be four different kinds of Productions. • Javacode • For something that is not context free or is difficult to write a grammar for. eg) recognizing matching braces and error processing. • Regular Expressions • Used to describe the tokens (terminals) of the grammar. • BNF • Standard way of specifying the productions of the grammar. • Token Manager Declarations • The declarations and statements are written into the generated Token Manager (lexer) and are accessible from within lexical actions.
Javacc Look-ahead mechanism • Exploration of tokens further ahead in the input stream. • Backtracking is unacceptable due to performance hit. • By default Javacc has 1 token look-ahead. Could specify any number for look-ahead. • Two types of look-ahead mechanisms • Syntactic A particular token is looked ahead in the input stream. • Semantic Any arbitrary Boolean expression can be specified as a look-ahead parameter. eg) A -> aBc and B -> b ( c )? Valid strings: “abc” and “abcc”
References • Compilers Principles, Techniques and Tools, Aho, Sethi, and Ullman