The technology to automatically generate a parser from this syntax specification
has existed for around 20 years and is now mature enough to use in a product
setting. A parser generator is a software program that accepts a syntax specification
as input and generates a parser for that syntax as output.
JavaCC (Java Compiler Compiler) is a parser generator that embodies
the state of the art in parser generation technology and generates parsers
in Java. Sreeni Viswanadha and I jointly developed JavaCC while we were at
Sun Microsystems. Robert Duncan also made some important contributions. Although
JavaCC is owned by Sun Microsystems and Sreeni and I are currently at WebGain,
we continue to maintain JavaCC. JavaCC is available for a free download from
WebGain’s Web site, as well as from Sun Microsystems.
The use of parsers and parser generators is ubiquitous. There’s always
a need for parsers in any large software system. Products such as WebLogic,
SQL/JDBC, JPython, Cloudscape, Velocity, BeanShell, JFactor, and Dynamo all
use parsers generated by JavaCC or its main competitor – ANTLR.
Especially now, with so many text processing applications being built,
it’s quite valuable to possess the skills to use parser generators – the
time saved by using a parser generator can be immense. As an extreme example,
I recently had to write an application that extracted all the type names
from a Java program. What took a couple of hours to do with JavaCC would’ve
taken a month to do from scratch.
This article provides a background on parser generation technology,
a summary of how JavaCC evolved, and a tutorial on its use.
Parsing, Semanticization, et al.
The processing of files typically requires parsing them and then performing
some additional work. The term parsing is used differently by different
groups, so I’d like to precisely define this term through examples using
a Java compiler.
Consider the following Java class:
class Interval_1 {
int left, right;
}
This is a correct Java class, which is accepted by the Java compiler,
and bytecode will be produced. Let’s modify this class as follows:
class Interval_2 {
nosuchtype left, right;
}
This is no longer correct because a type name is used (nosuchtype) that
doesn’t correspond to any declared type. However, this class is syntactically
correct (i.e., it conforms to the syntax of the Java language). Let’s make
another modification:
class Interval_3 {
int left; right;
}
This isn’t a correct Java class either because it uses a semicolon where
it should have used a comma. This class is wrong in a manner different from
that of Interval_2 in that it isn’t syntactically correct.
Parsing catches errors such as the one in Interval_3. Parsing won’t
catch the error in Interval_2. To catch that error, processing beyond parsing
is required. This requires analyzing the data structures produced by the
parsing process. The term semanticization is used to describe the
analysis that takes place after parsing on programs to catch errors such
as in Interval_2.
Typically, parsing can detect errors that can be described by a grammar
for the language. Depending on the sophistication of the parser and the complexity
of the grammar, the delineation of responsibility between parsing and semanticization
can vary.
The Evolution of Parsing Technologies
In the early days of language processing, parsers were written by hand.
Typically, the syntactic (parser) checks and the semantic checks were all
mixed together in one action, leading to highly complex code that was very
difficult to maintain. In the ’60s and early ’70s research in computer language
theory led to various important categorizations and algorithms that have
since permitted the automation of building parsers. Three of the important
language categories are described below, from most complex to simplest.
• Turing languages/Turing machines: The Turing machine
is an abstraction of a computer program. Every computer program has a corresponding
Turing machine. Turing languages are those that can be recognized by Turing
machines. For the purposes of this article, this definition is mainly theoretical;
it simply provides an upper limit to the complexity of computer languages.
All programming languages fit under the umbrella of Turing languages.
• Context-free languages: Those languages that can be described
by grammars, which consist of a set of productions, each describing a particular
aspect of the language. You must have seen such grammars in your Java texts.
An example of a production is:
methodcall ::= name "(" parameterlist ")"
This says that a method call starts with a name followed by an open
parenthesis, then a list of parameters, and then the closing parenthesis.
To complete this definition, other productions must define name and parameterlist.
Note: Java is not in itself context-free; it’s just that a large portion
of the language can be described using a context-free grammar. Java language
rules such as “A variable must be declared before it can be used” aren’t
context-free (and can’t be described by a grammar). Such rules are processed
in the semanticization phase.
• Regular expressions: The simplest of the three language
forms described here, regular expressions can be recognized by finite-state
machines. A finite-state machine is a graph where the nodes are the different
states that the machine may be in, and the directed edges between the states
are the characters allowed as the next character in the input to transition
from one state to the other. Regular expressions (and finite-state automata)
can be described by a simple grammar (which is less expressive than a context-free
grammar). The Java identifier is an example of a regular expression that
can be described by the following grammar:
identifier ::= firstcharacter othercharacter*
This says that an identifier is composed of a first character followed
by 0 or more (this is what the asterisk indicates) other characters. To complete
this definition, firstcharacter and othercharacter need to be defined.
A main simplification in a regular grammar as compared to a context-free
grammar is that the regular grammar can’t be recursive (i.e., a production
can’t use itself directly or indirectly in its description).
A significant step in the evolution of parsing technologies was the
development of algorithms and tools that could convert context-free grammars
and regular grammars into programs that could check for the correctness of
an input with respect to the grammar.
Since then, most computer languages have been designed so that programs
are a sequence of tokens, each of which is a regular expression, and the
tokens in turn are structured according to a context-free grammar. Not all
of the tokens participate in the context-free structure; some are discarded.
In the case of Java, discarded tokens include comments and spaces. The following
Java program example illustrates this:
// This is a simple Java class
class Simple {
int dummy;
}
The tokens in this Java program are listed below:
1. Comment: // This is a simple Java class
2. Reserved word: class
3. Single space
4. Identifier: Simple
5. Single space
6. Delimiter: {
7. Newline
8. Double space
9. Reserved word: int
10. Single space
11. Identifier: dummy
12. Delimiter: ;
13. Newline
14. Delimiter: }
Before the context-free grammar is matched, some tokens are discarded.
In this example tokens 1, 3, 5, 7, 8, 10, and 13 are discarded (the discarded
tokens are referred to as “white space”), leaving behind the tokens “class”,
“Simple”, “{“, “int”, “dummy”, “;”, and “}”.
These tokens are then matched with the context-free grammar for Java.
Using JavaCC – A Simple Example
Listing 1 is a simple grammar file written in the JavaCC grammar syntax.
It specifies that the input file should contain a sequence of left braces
followed by another sequence of matching (equal in number) right braces.
The file contains three parts:
1. The class specification: A Java class into which the
parser is generated by JavaCC, this class will have the fields and methods
described in the class specification. In addition, a method is generated
that corresponds to each production in the grammar. In this example the class
specification states that the class is called Simple and it contains a main
method that simply creates a parser object (of type Simple) and invokes it
on the standard input. The method Input, which is called by the main method,
is generated by JavaCC to correspond to one of the productions in the grammar
specification.
2. The lexical specification: Describes the tokens present
in the input file using regular expressions. In this example the tokens are
quite simple and can therefore be expressed as strings (the simplest form
of regular expressions). Six tokens are described: the space, tab, newline,
carriage return, left brace, and right brace. The first four are specified
in a SKIP declaration, which classifies them as white space (tokens to be
discarded), while the left and right braces are specified as the tokens that
participate in the (context-free) parsing process. Tokens that participate
in this process are called terminals. The brace tokens are given the
names LBRACE and RBRACE to make it possible to use them in the grammar. An
input file matches this lexical specification only if it contains a sequence
of tokens from this list of six tokens.
3. The grammar specification: It’s the context-free grammar
that describes the input. In this example there are two productions – the
first defines Input; the second, MatchedBraces. These are called nonterminals
. The first production states that Input is a MatchedBraces followed by the
end of the file. <EOF> is a special token (and therefore a terminal)
that is matched by the end of the file. This states that the input file can
only contain MatchedBraces, with nothing following. The second production
defines MatchedBraces recursively in terms of itself. It says that it starts
with a left brace (<LBRACE>), followed optionally by another MatchedBraces,
and ends with a right brace (<RBRACE>). The brackets [...] indicate
that their content is optional.
Now consider the following input file:
{
{
}
}
This file has the following tokens: <LBRACE>, newline, space,
<LBRACE>, newline, space, <RBRACE>, newline, <RBRACE>,
<EOF>. The process of matching the lexical specification is successful
and the white space is discarded, leaving behind only:
<LBRACE> <LBRACE> <RBRACE> <RBRACE> <EOF>
The parsing process then successfully matches these tokens with the
grammar as follows:
Input
/ \
/ \
MatchedBraces <EOF>
/ | \
/ | \
/ |
\
/ |
\
<LBRACE> MatchedBraces <RBRACE>
/ \
/ \
/ \
<LBRACE> <RBRACE>
The diagram shown above is called a parse tree.
You can try this example yourself. First create a grammar file with
the above contents. Call the file Simple.jj (you can call it anything, but
by convention you should give it the jj extension). Download JavaCC from
www.webgain.com and install it on your computer. Add the bin directory
within this installation into your path. Now you should be able to run “javacc”
from your command prompt (DOS window or UNIX xterm). Type:
javacc Simple.jj
It produces a bunch of Java files as output. Compile these Java files
by typing:
javac *.java
Now create different kinds of input files – both good and bad inputs
– and then parse them using the generated parser by typing:
java Simple < InputFile
At this point you’ve taken the biggest and most important step toward
learning JavaCC. The rest of the learning process will happen very quickly.
Actions
The previous example shows the use of a JavaCC parser in determining
whether an input matches its grammar. If it matches, the parser quietly returns;
if it fails, the parser prints an error message detailing the reason for
failure.
It’s also necessary to be able to make the parser do something useful
on a successful parse beyond simply declaring success. For this purpose JavaCC
provides actions, Java code that’s inserted at an appropriate location
in the grammar and executed as the parser matches that portion of the grammar.
Actions may be arbitrarily complex and may invoke library methods. There
are two general kinds of actions: lexical and parser.
Lexical actions are executed after a successful match of a token.
Parser actions are embedded within the grammar and are executed as
the parser goes past a point in the parsing process. The grammar file in
Listing 2 extends the previous one with both lexical and parser actions.
These actions are simply print statements; try running the parser generated
from this grammar file with the inputs you created for the previous exercise
to see how the actions are executed.
Once you complete this exercise, you should be able to go through the
sequence of examples in the JavaCC download and systematically learn more
details about JavaCC. Once you’ve gone through the examples, you should be
ready to embark on any complex parsing application.
Lookahead and Ambiguity Resolution
Both lexical and grammar specifications can be ambiguous. In such situations
standard ambiguity resolution schemes take effect.
Consider the lexical specification for the Java language. It defines
all the Java reserved words as well as Java identifiers. What should the
parser do when it encounters the sequence “interface”? The obvious answer
is that it should recognize it as the token corresponding to the reserved
word “interface”. However, there are other ways to recognize it, such as
an identifier “interface” or the reserved word “int” followed by the identifier
“erface”.
The ambiguity resolution scheme for lexical specifications is that the
longest match is preferred (so, in the above example, “interface” is preferred
over “int”), and when there are multiple matches of the same length, the
match that occurs first in the grammar file is used. In the example above,
the reserved word “interface” is selected over the identifier “interface”
if the lexical specification for the reserved word occurs earlier than that
of identifier.
Ambiguities in grammar specifications are resolved by looking at the
next available token and proceeding with the first successful match. This
resolution scheme is referred to as using a lookahead of one token. Consider
the grammar for demand import declarations in Java.
void DemandImportDeclaration() :
{}
{
"import" Name() "." "*" ";"
}
void Name() :
{}
{
<IDENTIFIER> ( "." <IDENTIFIER> )*
}
On seeing the input “import x.*;”, the parser wil match the "." with
the one in the production for Name. It will fail the parse because it expects
an identifier and not a "*". This grammar is therefore wrong because the
ambiguity is resolved incorrectly. The parse tree describing this parse process
is shown in the following diagram.
DemandImportDeclaration
/ |
/ |
"import" Name
/ |
/ |
"x" "."
To correct this problem, we must specify that the parser looks ahead
more tokens before it decides how to parse. In this example a lookahead of
two tokens will work. This can be specified as a global setting (so that
a lookahead of two tokens is used for all choices) or a lookahead of two
tokens can be specified for this particular point only. The global setting
causes the parser to become rather inefficient – typically, most choices
can be made correctly with a lookahead of one – so the latter approach is
the correct one. The grammar, modified with the additional lookahead specification,
is shown below.
void DemandImportDeclaration() :
{}
{
"import" Name() "." "*" ";"
}
void Name() :
{}
{
<IDENTIFIER> ( LOOKAHEAD(2) "." <IDENTIFIER> )*
}
There are other more complex ways to specify lookaheads, but we won’t
go into them here. You can study the JavaCC documentation and examples for
more information.
Advanced Topics
Other aspects of JavaCC not described here but still very important
and of significant value to JavaCC are listed below:
• Special tokens: Sometimes it’s preferable to process
some white space tokens during the parsing process. JavaCC provides a way
to specify tokens so they don’t get discarded but still don’t participate
in the parsing process.
• Lexical states: This allows you to split your lexical
specification into different categories within which matches and ambiguity
resolution take place independently of the other states. This allows you
to conveniently write lexical specifications for (say) mail files, where
the string “From” may be a reserved word in one context and a simple text
in another.
• Java code productions: This allows you to write actual
Java code for a production instead of a grammar. Sometimes it isn’t possible
to write a production using the context-free syntax, and Java code productions
become invaluable. However, this feature must be used sparingly.
• JJTree: This is a JavaCC preprocessor that inserts parse
tree building actions into a grammar based on a parse tree specification.
The result of the parsing process is a tree data structure that captures
the essence of the parsing process. Additional processing is then performed
on this tree (no additional actions are added by hand into the grammar).
Conclusion
Parsers in general, and JavaCC in particular, are powerful tools for
enhancing developer productivity. There’s always some need in any large software
system for parsers, and the use of parsers and parser generators is ubiquitous.
Like other parser generators, JavaCC reads a grammar specification and
converts the code into a Java program that recognizes matches to the grammar.
In addition to the parser generator itself, JavaCC also provides other standard
capabilities related to parser generation, such as tree building (via a tool
called JJTree), actions, and debugging.
JavaCC strongly supports the industry trend toward standardizing the
syntax of important data formats such as XML. By giving developers a powerful
tool for automating the creation of Java code that can recognize a standardized
language structure, JavaCC will play a valuable role in the evolution of
open systems.
Author Bio
Sriram Sankar is VP of engineering at WebGain, a Java developer products
company. He came to WebGain through the acquisition of Metamata, where he
was founder and CEO. Sriram previously worked for Sun Microsystems and Stanford
University. He holds a PhD in computer science from Stanford University.
ssankar@webgain.com
Listing 1
// 1. Class specification
PARSER_BEGIN(Simple)
public class Simple {
public static void main(String args[]) throws ParseException {
Simple parser = new Simple(System.in);
parser.Input();
}
}
PARSER_END(Simple)
// 2. Lexical specification
SKIP :
{
" "
| "\t"
| "\n"
| "\r"
}
TOKEN :
{
<LBRACE: "{">
| <RBRACE: "}">
}
// 3. Grammar specification
void Input() :
{}
{
MatchedBraces() <EOF>
}
void MatchedBraces() :
{}
{
<LBRACE> [ MatchedBraces() ] <RBRACE>
}
Listing 2
// 1. Class specification
PARSER_BEGIN(Simple)
public class Simple {
public static void main(String args[]) throws ParseException {
Simple parser = new Simple(System.in);
parser.Input();
}
}
PARSER_END(Simple)
// 2. Lexical specification
SKIP :
{
" "
| "\t"
| "\n"
| "\r"
}
TOKEN :
{
<LBRACE: "{"> { System.out.println("Lexical action: Encountered LBRACE"); }
| <RBRACE: "}"> { System.out.println("Lexical action: Encountered RBRACE"); }
}
// 3. Grammar specification
void Input() :
{}
{
MatchedBraces() <EOF>
}
void MatchedBraces() :
{}
{
<LBRACE>
{ System.out.println("Parser action: Successfully parsed LBRACE"); }
[ MatchedBraces() ]
<RBRACE>
{ System.out.println("Parser action: Successfully parsed RBRACE"); }
}