HomeDigital EditionSys-Con RadioSearch Java Cd
Advanced Java AWT Book Reviews/Excerpts Client Server Corba Editorials Embedded Java Enterprise Java IDE's Industry Watch Integration Interviews Java Applet Java & Databases Java & Web Services Java Fundamentals Java Native Interface Java Servlets Java Beans J2ME Libraries .NET Object Orientation Observations/IMHO Product Reviews Scalability & Performance Security Server Side Source Code Straight Talking Swing Threads Using Java with others Wireless XML

The technology to automatically generate a parser from this syntax specification has existed for around 20 years and is now mature enough to use in a product setting. A parser generator is a software program that accepts a syntax specification as input and generates a parser for that syntax as output.

JavaCC (Java Compiler Compiler) is a parser generator that embodies the state of the art in parser generation technology and generates parsers in Java. Sreeni Viswanadha and I jointly developed JavaCC while we were at Sun Microsystems. Robert Duncan also made some important contributions. Although JavaCC is owned by Sun Microsystems and Sreeni and I are currently at WebGain, we continue to maintain JavaCC. JavaCC is available for a free download from WebGain’s Web site, as well as from Sun Microsystems.

The use of parsers and parser generators is ubiquitous. There’s always a need for parsers in any large software system. Products such as WebLogic, SQL/JDBC, JPython, Cloudscape, Velocity, BeanShell, JFactor, and Dynamo all use parsers generated by JavaCC or its main competitor – ANTLR.

Especially now, with so many text processing applications being built, it’s quite valuable to possess the skills to use parser generators – the time saved by using a parser generator can be immense. As an extreme example, I recently had to write an application that extracted all the type names from a Java program. What took a couple of hours to do with JavaCC would’ve taken a month to do from scratch.

This article provides a background on parser generation technology, a summary of how JavaCC evolved, and a tutorial on its use.

Parsing, Semanticization, et al.
The processing of files typically requires parsing them and then performing some additional work. The term parsing is used differently by different groups, so I’d like to precisely define this term through examples using a Java compiler.

Consider the following Java class:

class Interval_1 {
int left, right;
}

This is a correct Java class, which is accepted by the Java compiler, and bytecode will be produced. Let’s modify this class as follows:

class Interval_2 {
nosuchtype left, right;
}

This is no longer correct because a type name is used (nosuchtype) that doesn’t correspond to any declared type. However, this class is syntactically correct (i.e., it conforms to the syntax of the Java language). Let’s make another modification:

class Interval_3 {
int left; right;
}

This isn’t a correct Java class either because it uses a semicolon where it should have used a comma. This class is wrong in a manner different from that of Interval_2 in that it isn’t syntactically correct.

Parsing catches errors such as the one in Interval_3. Parsing won’t catch the error in Interval_2. To catch that error, processing beyond parsing is required. This requires analyzing the data structures produced by the parsing process. The term semanticization is used to describe the analysis that takes place after parsing on programs to catch errors such as in Interval_2.

Typically, parsing can detect errors that can be described by a grammar for the language. Depending on the sophistication of the parser and the complexity of the grammar, the delineation of responsibility between parsing and semanticization can vary.

The Evolution of Parsing Technologies
In the early days of language processing, parsers were written by hand. Typically, the syntactic (parser) checks and the semantic checks were all mixed together in one action, leading to highly complex code that was very difficult to maintain. In the ’60s and early ’70s research in computer language theory led to various important categorizations and algorithms that have since permitted the automation of building parsers. Three of the important language categories are described below, from most complex to simplest.

Turing languages/Turing machines: The Turing machine is an abstraction of a computer program. Every computer program has a corresponding Turing machine. Turing languages are those that can be recognized by Turing machines. For the purposes of this article, this definition is mainly theoretical; it simply provides an upper limit to the complexity of computer languages. All programming languages fit under the umbrella of Turing languages.

• Context-free languages: Those languages that can be described by grammars, which consist of a set of productions, each describing a particular aspect of the language. You must have seen such grammars in your Java texts. An example of a production is:

methodcall ::= name "(" parameterlist ")"

This says that a method call starts with a name followed by an open parenthesis, then a list of parameters, and then the closing parenthesis. To complete this definition, other productions must define name and parameterlist. Note: Java is not in itself context-free; it’s just that a large portion of the language can be described using a context-free grammar. Java language rules such as “A variable must be declared before it can be used” aren’t context-free (and can’t be described by a grammar). Such rules are processed in the semanticization phase.

• Regular expressions: The simplest of the three language forms described here, regular expressions can be recognized by finite-state machines. A finite-state machine is a graph where the nodes are the different states that the machine may be in, and the directed edges between the states are the characters allowed as the next character in the input to transition from one state to the other. Regular expressions (and finite-state automata) can be described by a simple grammar (which is less expressive than a context-free grammar). The Java identifier is an example of a regular expression that can be described by the following grammar:

identifier ::= firstcharacter othercharacter*

This says that an identifier is composed of a first character followed by 0 or more (this is what the asterisk indicates) other characters. To complete this definition, firstcharacter and othercharacter need to be defined.

A main simplification in a regular grammar as compared to a context-free grammar is that the regular grammar can’t be recursive (i.e., a production can’t use itself directly or indirectly in its description).

A significant step in the evolution of parsing technologies was the development of algorithms and tools that could convert context-free grammars and regular grammars into programs that could check for the correctness of an input with respect to the grammar.

Since then, most computer languages have been designed so that programs are a sequence of tokens, each of which is a regular expression, and the tokens in turn are structured according to a context-free grammar. Not all of the tokens participate in the context-free structure; some are discarded. In the case of Java, discarded tokens include comments and spaces. The following Java program example illustrates this:

// This is a simple Java class
class Simple {
int dummy;
}

The tokens in this Java program are listed below:

1. Comment: // This is a simple Java class
2. Reserved word: class
3. Single space
4. Identifier: Simple
5. Single space
6. Delimiter: {
7. Newline
8. Double space
9. Reserved word: int
10. Single space
11. Identifier: dummy
12. Delimiter: ;
13. Newline
14. Delimiter: }

Before the context-free grammar is matched, some tokens are discarded. In this example tokens 1, 3, 5, 7, 8, 10, and 13 are discarded (the discarded tokens are referred to as “white space”), leaving behind the tokens “class”, “Simple”, “{“, “int”, “dummy”, “;”, and “}”.

These tokens are then matched with the context-free grammar for Java.

Using JavaCC – A Simple Example
Listing 1 is a simple grammar file written in the JavaCC grammar syntax. It specifies that the input file should contain a sequence of left braces followed by another sequence of matching (equal in number) right braces. The file contains three parts:

1. The class specification: A Java class into which the parser is generated by JavaCC, this class will have the fields and methods described in the class specification. In addition, a method is generated that corresponds to each production in the grammar. In this example the class specification states that the class is called Simple and it contains a main method that simply creates a parser object (of type Simple) and invokes it on the standard input. The method Input, which is called by the main method, is generated by JavaCC to correspond to one of the productions in the grammar specification.

2. The lexical specification: Describes the tokens present in the input file using regular expressions. In this example the tokens are quite simple and can therefore be expressed as strings (the simplest form of regular expressions). Six tokens are described: the space, tab, newline, carriage return, left brace, and right brace. The first four are specified in a SKIP declaration, which classifies them as white space (tokens to be discarded), while the left and right braces are specified as the tokens that participate in the (context-free) parsing process. Tokens that participate in this process are called terminals. The brace tokens are given the names LBRACE and RBRACE to make it possible to use them in the grammar. An input file matches this lexical specification only if it contains a sequence of tokens from this list of six tokens.

3. The grammar specification: It’s the context-free grammar that describes the input. In this example there are two productions – the first defines Input; the second, MatchedBraces. These are called nonterminals . The first production states that Input is a MatchedBraces followed by the end of the file. <EOF> is a special token (and therefore a terminal) that is matched by the end of the file. This states that the input file can only contain MatchedBraces, with nothing following. The second production defines MatchedBraces recursively in terms of itself. It says that it starts with a left brace (<LBRACE>), followed optionally by another MatchedBraces, and ends with a right brace (<RBRACE>). The brackets [...] indicate that their content is optional.

Now consider the following input file:

{
   {
   }
}

This file has the following tokens: <LBRACE>, newline, space, <LBRACE>, newline, space, <RBRACE>, newline, <RBRACE>, <EOF>. The process of matching the lexical specification is successful and the white space is discarded, leaving behind only:

<LBRACE> <LBRACE> <RBRACE> <RBRACE> <EOF>

The parsing process then successfully matches these tokens with the grammar as follows:

         Input
          / \
         /   \
MatchedBraces <EOF>
          / | \
         /  |   \
       /    |     \
     /      |       \
<LBRACE> MatchedBraces <RBRACE>
       / \
      /   \
     /     \
<LBRACE> <RBRACE>

The diagram shown above is called a parse tree.

You can try this example yourself. First create a grammar file with the above contents. Call the file Simple.jj (you can call it anything, but by convention you should give it the jj extension). Download JavaCC from www.webgain.com and install it on your computer. Add the bin directory within this installation into your path. Now you should be able to run “javacc” from your command prompt (DOS window or UNIX xterm). Type:

javacc Simple.jj

It produces a bunch of Java files as output. Compile these Java files by typing:

javac *.java

Now create different kinds of input files – both good and bad inputs – and then parse them using the generated parser by typing:

java Simple < InputFile

At this point you’ve taken the biggest and most important step toward learning JavaCC. The rest of the learning process will happen very quickly.

Actions
The previous example shows the use of a JavaCC parser in determining whether an input matches its grammar. If it matches, the parser quietly returns; if it fails, the parser prints an error message detailing the reason for failure.

It’s also necessary to be able to make the parser do something useful on a successful parse beyond simply declaring success. For this purpose JavaCC provides actions, Java code that’s inserted at an appropriate location in the grammar and executed as the parser matches that portion of the grammar. Actions may be arbitrarily complex and may invoke library methods. There are two general kinds of actions: lexical and parser.

Lexical actions are executed after a successful match of a token. Parser actions are embedded within the grammar and are executed as the parser goes past a point in the parsing process. The grammar file in Listing 2 extends the previous one with both lexical and parser actions. These actions are simply print statements; try running the parser generated from this grammar file with the inputs you created for the previous exercise to see how the actions are executed.

Once you complete this exercise, you should be able to go through the sequence of examples in the JavaCC download and systematically learn more details about JavaCC. Once you’ve gone through the examples, you should be ready to embark on any complex parsing application.

Lookahead and Ambiguity Resolution
Both lexical and grammar specifications can be ambiguous. In such situations standard ambiguity resolution schemes take effect.

Consider the lexical specification for the Java language. It defines all the Java reserved words as well as Java identifiers. What should the parser do when it encounters the sequence “interface”? The obvious answer is that it should recognize it as the token corresponding to the reserved word “interface”. However, there are other ways to recognize it, such as an identifier “interface” or the reserved word “int” followed by the identifier “erface”.

The ambiguity resolution scheme for lexical specifications is that the longest match is preferred (so, in the above example, “interface” is preferred over “int”), and when there are multiple matches of the same length, the match that occurs first in the grammar file is used. In the example above, the reserved word “interface” is selected over the identifier “interface” if the lexical specification for the reserved word occurs earlier than that of identifier.

Ambiguities in grammar specifications are resolved by looking at the next available token and proceeding with the first successful match. This resolution scheme is referred to as using a lookahead of one token. Consider the grammar for demand import declarations in Java.

void DemandImportDeclaration() :
{}
{
"import" Name() "." "*" ";"
}
void Name() :
{}
{
<IDENTIFIER> ( "." <IDENTIFIER> )*
}

On seeing the input “import x.*;”, the parser wil match the "." with the one in the production for Name. It will fail the parse because it expects an identifier and not a "*". This grammar is therefore wrong because the ambiguity is resolved incorrectly. The parse tree describing this parse process is shown in the following diagram.

DemandImportDeclaration
   /      |
   /      |
"import" Name
   / |
  /  |
"x" "."

To correct this problem, we must specify that the parser looks ahead more tokens before it decides how to parse. In this example a lookahead of two tokens will work. This can be specified as a global setting (so that a lookahead of two tokens is used for all choices) or a lookahead of two tokens can be specified for this particular point only. The global setting causes the parser to become rather inefficient – typically, most choices can be made correctly with a lookahead of one – so the latter approach is the correct one. The grammar, modified with the additional lookahead specification, is shown below.

void DemandImportDeclaration() :
{}
{
"import" Name() "." "*" ";"
}

void Name() :
{}
{
<IDENTIFIER> ( LOOKAHEAD(2) "." <IDENTIFIER> )*
}

There are other more complex ways to specify lookaheads, but we won’t go into them here. You can study the JavaCC documentation and examples for more information.

Advanced Topics
Other aspects of JavaCC not described here but still very important and of significant value to JavaCC are listed below:

• Special tokens: Sometimes it’s preferable to process some white space tokens during the parsing process. JavaCC provides a way to specify tokens so they don’t get discarded but still don’t participate in the parsing process.

• Lexical states: This allows you to split your lexical specification into different categories within which matches and ambiguity resolution take place independently of the other states. This allows you to conveniently write lexical specifications for (say) mail files, where the string “From” may be a reserved word in one context and a simple text in another.

• Java code productions: This allows you to write actual Java code for a production instead of a grammar. Sometimes it isn’t possible to write a production using the context-free syntax, and Java code productions become invaluable. However, this feature must be used sparingly.

• JJTree: This is a JavaCC preprocessor that inserts parse tree building actions into a grammar based on a parse tree specification. The result of the parsing process is a tree data structure that captures the essence of the parsing process. Additional processing is then performed on this tree (no additional actions are added by hand into the grammar).

Conclusion
Parsers in general, and JavaCC in particular, are powerful tools for enhancing developer productivity. There’s always some need in any large software system for parsers, and the use of parsers and parser generators is ubiquitous.

Like other parser generators, JavaCC reads a grammar specification and converts the code into a Java program that recognizes matches to the grammar. In addition to the parser generator itself, JavaCC also provides other standard capabilities related to parser generation, such as tree building (via a tool called JJTree), actions, and debugging.

JavaCC strongly supports the industry trend toward standardizing the syntax of important data formats such as XML. By giving developers a powerful tool for automating the creation of Java code that can recognize a standardized language structure, JavaCC will play a valuable role in the evolution of open systems.

Author Bio
Sriram Sankar is VP of engineering at WebGain, a Java developer products company. He came to WebGain through the acquisition of Metamata, where he was founder and CEO. Sriram previously worked for Sun Microsystems and Stanford University. He holds a PhD in computer science from Stanford University.

[email protected]

	


Listing 1

// 1. Class specification

PARSER_BEGIN(Simple)

public class Simple {

 public static void main(String args[]) throws ParseException {
  Simple parser = new Simple(System.in);
  parser.Input();
 }

}

PARSER_END(Simple)

// 2. Lexical specification

SKIP :
{
 " "
| "\t"
| "\n"
| "\r"
}

TOKEN :
{
 <LBRACE: "{">
| <RBRACE: "}">
}

// 3. Grammar specification

void Input() :
{}
{
 MatchedBraces() <EOF>
}

void MatchedBraces() :
{}
{
 <LBRACE> [ MatchedBraces() ] <RBRACE>
}


Listing 2

// 1. Class specification

PARSER_BEGIN(Simple)

public class Simple {

 public static void main(String args[]) throws ParseException {
  Simple parser = new Simple(System.in);
  parser.Input();
 }

}

PARSER_END(Simple)

// 2. Lexical specification

SKIP :
{
 " "
| "\t"
| "\n"
| "\r"
}

TOKEN :
{
 <LBRACE: "{"> { System.out.println("Lexical action: Encountered LBRACE"); }
| <RBRACE: "}"> { System.out.println("Lexical action: Encountered RBRACE"); }
}

// 3. Grammar specification

void Input() :
{}
{
 MatchedBraces() <EOF>
}

void MatchedBraces() :
{}
{
 <LBRACE> 
 { System.out.println("Parser action: Successfully parsed LBRACE"); }
 [ MatchedBraces() ]
 <RBRACE>
 { System.out.println("Parser action: Successfully parsed RBRACE"); }
}

  
 

All Rights Reserved
Copyright ©  2004 SYS-CON Media, Inc.
  E-mail: [email protected]

Java and Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. SYS-CON Publications, Inc. is independent of Sun Microsystems, Inc.