A regular expression (regex) is an essential part of software
development. Indeed, the programming language Perl is, in effect, a
language written around a regex parser.
This article focuses on Sun's implementation of a regular
expression package, java.util.regex. In addition, this article also
assumes you have some familiarity with regular expressions (if not,
see the sidebar for a brief introduction). Each regular expression
used here, however, is fully described, so even a regex neophyte
should have no trouble.
The regex package is divided into two major packages: pattern
and matcher. Before going into the details of these packages, it's
important to understand their relationship.
A pattern contains a specific regular expression that's
created by compiling a regex string. If the string doesn't compile,
the pattern isn't valid and a PatternSyntaxException is thrown. You
should always have a try/catch block when compiling regexes to catch
this exception. For reasons of brevity, exception handling is
excluded from the examples here.
A matcher is an object that does the real "grunt work." It
holds a reference to the input stream (of type CharSequence, like
String or StringBuffer) and keeps all sorts of stateful information
about the results of a pattern search, string start/end locations,
etc. A matcher object can only be created through an instance of a
pattern.
The most basic creation sequence is:
- Compile a regex: If it's valid, it will return a pattern,
otherwise it will throw a PatternSyntaxException.
- Create a matcher: Give the pattern instance an input set to
match against.
Once you have a matcher, there are three Boolean functions
you can use to determine if the input conforms to the pattern. Of the
three you will typically use, the "find()" method returns a Boolean
the moment it matches the regex. The matcher will then also remember
where the last match was and will pick up from that point with the
next "find()" method (for those of you familiar with Perl, this
behavior is similar to the "g" modifier at the end of your regex).
Our first example performs the basic creation sequence
mentioned above, then calls "find()" repeatedly to count the number
of times the word "fish" exists in the input sequence (see Listing 1).
While the example itself isn't interesting, the above "code
pattern" occurs frequently when writing software that uses the regex
packages. Make a point to remember it!
Now we'll see how we can replace a matched pattern with
something else. There are two mechanisms to change an input string:
appendReplacement() and appendTail(). Understanding how these two
work together is tricky, but mastering this relationship is critical.
Let's look at how the two work together to perform a simple
replacement.
Let's use the pattern "fish" and the input "The fish in the
hat". We want to replace the pattern with "cat". We first create a
pattern and matcher, then do a find. When the find returns true, we
create a new StringBuffer to hold our modified string (never modify
the matcher's input string directly!), then begin changing our output
string.
At this point, it's important to remember that our Matcher
instance, "m", now knows the start and end point of the most recent
match. It also knows the end point of the previous match (which is
"0" if there is no previous match). When appendReplacement is called,
it uses this information to perform the following two steps:
- It appends everything from the end of the previous match-up
to the beginning of the first match (in our example, that's simply
"The") to the output string.
- It then appends the replacement string onto the output
string. Our output string is now "The cat".
Finally, we call appendTail, which replaces the remainder of
the original input string ("in the hat") to the output string,
yielding the expected result (see Listing 2).
The preceding example is only useful for single replacements,
which isn't very realistic. Let's modify this example to replace all
occurrences of a pattern by using a while loop (see Listing 3).
Conveniently, the matcher class has a "replaceAll"
method that will do exactly what the preceding code will do. It is,
however, only useful for simple string replacements.
Now that you know the basics of searching and replacing,
let's look at the more powerful features of the regex package - using
groups and quantifiers.
The simple pattern, "one.*two", is typically read
(comprehended) as: "The sequence 'one', followed by any number of any
kind of character, followed by 'two'". Because we have used the
"greedy" quantifier, "*", this regex will get the largest match it
can find. Be careful using quantifiers, as they can yield strange
results. Mastery of quantifiers, however, is essential to writing good
regular expressions. Jeffrey Friedl's
book, Mastering Regular Expressions, (O'Reilly) contains numerous
examples of the many different forms of quantifiers, and I recommend
reading it to learn more.
Back to our example, calling find would result in the matcher
"marking" these places (noted by the arrows): "_one if by Java, two_
if by C". This is interesting if we want to replace the entire regex,
but what if we want to replace only the characters between the "one"
and the "two"? This is where grouping comes in.
Grouping is the most powerful feature of the regex package
because it allows us to manipulate sets of substrings. If we change
the regex to include the grouping markers (open/closed parentheses),
we then create nested groups. These groups start with the number 1;
group(0) is always the entire matched pattern. If we used the regex
"one(.*)two", we would generate a "group list" in our match.
Let's look at a simple example now, as seen in Listing 4.
Note that the for loop uses "<=", rather than the traditional "<".
Our results are:
>java ShowGroups
Group(0) is "one if by Java, two"
Group(1) is " if by Java, "
Let's wrap this up by looking at how you would write a simple
XML-style tag parser.
First, study the regex string in Figure 1. This regex
introduces a "backreference", which is when a group is
self-referenced in a regex. In this example, the first group (which
matches a tag name) is later used as a backreference, expressed as a
backslash followed by a group number. This is a powerful and useful
feature of regular expressions.
Figure 1:
The first group is the first "(.*)" that you see. It is used
to initially "guess" that it sees a tag. The regex, however, says
it's not a tag unless that same string is on the tail end, which is
where you see the backreference "\1".
At this point, you should notice a strange quirk of Java. In
normal regex strings, the backreference would just look like "\1",
but since escaped characters are interpreted by the Java compiler
(e.g., "\n"), we must use a "double backslash" to signify a regex
"escape". You must test your Java regex strings carefully, as this
single quirk can cause you hours of grief (to which this author can
testify!).
Now let's look at the code. It's really a simple example of
recursion. The findTag method is simply handed an initial input
string. When the pattern matches, everything inside the tag
boundaries (Group 2) is again handed to findTag, and the parsing
starts again (see Listing 5).
When we run the program, we get:
>java tagParser
Found tag: bold, inner string = <italic>bold-italic</italic>
Found tag: italic, inner string = bold-italic
I encourage you to look at other regular expression packages
if you intend to do extremely complex regex work. The most notable
example is the ORO package freely available through the Apache
Jakarta Project (http://jakarta.apache.org/oro). It's more full-featured than the JDK regex
package, but the usage is similar. Sadly, Sun decided not to
implement the regex package through interfaces, making it (currently)
impossible to freely switch between the JDK and ORO regex packages.
On the brighter side, the Sun regex package is full-featured enough
for places where a typical regex engine is needed.
In conclusion, I hope I've helped you to understand how to
use the power of the regex classes. The graceful combination of the
pattern and matcher classes helps maintain a separation of concerns.
This addition to Java has been long overdue. Have fun, and happy
parsing!
Acknowledgment
The author would like to acknowledge the gracious feedback of
Roger Moore of Valtech Technologies (Dallas) and Tom Wood of Valtech
Technologies (Houston).
Author Bio
David Weller is a principal managing consultant at Valtech Technologies, Inc., an international consulting firmspecializing in .NET/J2EE/Unified Process development, skills transfer, and training. He holds a computer science degree from the
University of Houston at Clear Lake.
dgweller@despammed.com
Listing 1
import java.io.*;
import java.util.*;
import java.util.regex.*;
public final class Simple {
public static final void main(String args[]) {
Pattern p = Pattern.compile("fish");
Matcher m = p.matcher("one fish, two fish, red fish,
blue fish");
int count = 0;
while (m.find()) {
count++;
}
System.out.println("Count = " + count);
}
}
Running this program yields:
D:\projects\regex>java simple
Count = 4
Listing 2
D:\projects\regex>java Replace
The cat in the hat
Here's the code itself:
import java.io.*;
import java.util.*;
import java.util.regex.*;
public final class Replace {
public static final void main(String args[]) {
Pattern p = Pattern.compile("fish");
Matcher m = p.matcher("The fish in the hat");
StringBuffer output = new StringBuffer();
if (m.find()) {
m.appendReplacement(output, "cat");
m.appendTail(output);
}
System.out.println(output.toString());
}
}
Listing 3
import java.io.*;
import java.util.*;
import java.util.regex.*;
public final class ReplaceAll {
public static final void main(String args[]) {
Pattern p = Pattern.compile("fish");
Matcher m = p.matcher("One fish, two fish, red fish,
blue fish");
StringBuffer output = new StringBuffer();
while (m.find()) {
m.appendReplacement(output, "cat");
}
m.appendTail(output);
System.out.println(output.toString());
}
}
Listing 4
import java.io.*;
import java.util.*;
import java.util.regex.*;
public final class ShowGroups {
public static final void main(String args[]) {
StringBuffer input = new StringBuffer("one if by Java,
two if by C");
Pattern p = Pattern.compile("one(.*)two");
Matcher m = p.matcher(input);
while (m.find()) {
for(int i=0;i <= m.groupCount(); i++){
System.out.println("Group("+i+") is \"" + m.group(i) + "\"");
}
}
}
}
Listing 5
import java.io.*;
import java.util.*;
import java.util.regex.*;
public final class tagParser {
public static final void main(String args[]) {
Pattern p = Pattern.compile("<(.*)>(.*)</\\1>");
String input = "This is <bold><italic>bold-italic</
italic></bold>";
findTag(p,input);
}
private static final void findTag(Pattern p, String in) {
Matcher m = p.matcher(in);
boolean result = m.find();
while (result) {
System.out.println("Found tag: " + m.group(1) + ", inner
string = " + m.group(2) );
findTag(p,m.group(2));
result = m.find();
}
}
}