HomeDigital EditionSys-Con RadioSearch Java Cd
Advanced Java AWT Book Reviews/Excerpts Client Server Corba Editorials Embedded Java Enterprise Java IDE's Industry Watch Integration Interviews Java Applet Java & Databases Java & Web Services Java Fundamentals Java Native Interface Java Servlets Java Beans J2ME Libraries .NET Object Orientation Observations/IMHO Product Reviews Scalability & Performance Security Server Side Source Code Straight Talking Swing Threads Using Java with others Wireless XML

A regular expression (regex) is an essential part of software development. Indeed, the programming language Perl is, in effect, a language written around a regex parser.

This article focuses on Sun's implementation of a regular expression package, java.util.regex. In addition, this article also assumes you have some familiarity with regular expressions (if not, see the sidebar for a brief introduction). Each regular expression used here, however, is fully described, so even a regex neophyte should have no trouble.

The regex package is divided into two major packages: pattern and matcher. Before going into the details of these packages, it's important to understand their relationship.

A pattern contains a specific regular expression that's created by compiling a regex string. If the string doesn't compile, the pattern isn't valid and a PatternSyntaxException is thrown. You should always have a try/catch block when compiling regexes to catch this exception. For reasons of brevity, exception handling is excluded from the examples here.

A matcher is an object that does the real "grunt work." It holds a reference to the input stream (of type CharSequence, like String or StringBuffer) and keeps all sorts of stateful information about the results of a pattern search, string start/end locations, etc. A matcher object can only be created through an instance of a pattern.

The most basic creation sequence is:

  • Compile a regex: If it's valid, it will return a pattern, otherwise it will throw a PatternSyntaxException.
  • Create a matcher: Give the pattern instance an input set to match against.

Once you have a matcher, there are three Boolean functions you can use to determine if the input conforms to the pattern. Of the three you will typically use, the "find()" method returns a Boolean the moment it matches the regex. The matcher will then also remember where the last match was and will pick up from that point with the next "find()" method (for those of you familiar with Perl, this behavior is similar to the "g" modifier at the end of your regex).

Our first example performs the basic creation sequence mentioned above, then calls "find()" repeatedly to count the number of times the word "fish" exists in the input sequence (see Listing 1).

While the example itself isn't interesting, the above "code pattern" occurs frequently when writing software that uses the regex packages. Make a point to remember it!

Now we'll see how we can replace a matched pattern with something else. There are two mechanisms to change an input string: appendReplacement() and appendTail(). Understanding how these two work together is tricky, but mastering this relationship is critical. Let's look at how the two work together to perform a simple replacement.

Let's use the pattern "fish" and the input "The fish in the hat". We want to replace the pattern with "cat". We first create a pattern and matcher, then do a find. When the find returns true, we create a new StringBuffer to hold our modified string (never modify the matcher's input string directly!), then begin changing our output string.

At this point, it's important to remember that our Matcher instance, "m", now knows the start and end point of the most recent match. It also knows the end point of the previous match (which is "0" if there is no previous match). When appendReplacement is called, it uses this information to perform the following two steps:

  1. It appends everything from the end of the previous match-up to the beginning of the first match (in our example, that's simply "The") to the output string.
  2. It then appends the replacement string onto the output string. Our output string is now "The cat".
Finally, we call appendTail, which replaces the remainder of the original input string ("in the hat") to the output string, yielding the expected result (see Listing 2).

The preceding example is only useful for single replacements, which isn't very realistic. Let's modify this example to replace all occurrences of a pattern by using a while loop (see Listing 3).

Conveniently, the matcher class has a "replaceAll" method that will do exactly what the preceding code will do. It is, however, only useful for simple string replacements.

Now that you know the basics of searching and replacing, let's look at the more powerful features of the regex package - using groups and quantifiers.

The simple pattern, "one.*two", is typically read (comprehended) as: "The sequence 'one', followed by any number of any kind of character, followed by 'two'". Because we have used the "greedy" quantifier, "*", this regex will get the largest match it can find. Be careful using quantifiers, as they can yield strange results. Mastery of quantifiers, however, is essential to writing good regular expressions. Jeffrey Friedl's book, Mastering Regular Expressions, (O'Reilly) contains numerous examples of the many different forms of quantifiers, and I recommend reading it to learn more.

Back to our example, calling find would result in the matcher "marking" these places (noted by the arrows): "_one if by Java, two_ if by C". This is interesting if we want to replace the entire regex, but what if we want to replace only the characters between the "one" and the "two"? This is where grouping comes in.

Grouping is the most powerful feature of the regex package because it allows us to manipulate sets of substrings. If we change the regex to include the grouping markers (open/closed parentheses), we then create nested groups. These groups start with the number 1; group(0) is always the entire matched pattern. If we used the regex "one(.*)two", we would generate a "group list" in our match.

Let's look at a simple example now, as seen in Listing 4.

Note that the for loop uses "<=", rather than the traditional "<".

Our results are:

>java ShowGroups
Group(0) is "one if by Java, two"
Group(1) is " if by Java, "

Let's wrap this up by looking at how you would write a simple XML-style tag parser.

First, study the regex string in Figure 1. This regex introduces a "backreference", which is when a group is self-referenced in a regex. In this example, the first group (which matches a tag name) is later used as a backreference, expressed as a backslash followed by a group number. This is a powerful and useful feature of regular expressions.

Figure 1
Figure  1:

The first group is the first "(.*)" that you see. It is used to initially "guess" that it sees a tag. The regex, however, says it's not a tag unless that same string is on the tail end, which is where you see the backreference "\1".

At this point, you should notice a strange quirk of Java. In normal regex strings, the backreference would just look like "\1", but since escaped characters are interpreted by the Java compiler (e.g., "\n"), we must use a "double backslash" to signify a regex "escape". You must test your Java regex strings carefully, as this single quirk can cause you hours of grief (to which this author can testify!).

Now let's look at the code. It's really a simple example of recursion. The findTag method is simply handed an initial input string. When the pattern matches, everything inside the tag boundaries (Group 2) is again handed to findTag, and the parsing starts again (see Listing 5).

When we run the program, we get:

>java tagParser
Found tag: bold, inner string = <italic>bold-italic</italic>
Found tag: italic, inner string = bold-italic

I encourage you to look at other regular expression packages if you intend to do extremely complex regex work. The most notable example is the ORO package freely available through the Apache Jakarta Project (http://jakarta.apache.org/oro). It's more full-featured than the JDK regex package, but the usage is similar. Sadly, Sun decided not to implement the regex package through interfaces, making it (currently) impossible to freely switch between the JDK and ORO regex packages. On the brighter side, the Sun regex package is full-featured enough for places where a typical regex engine is needed.

In conclusion, I hope I've helped you to understand how to use the power of the regex classes. The graceful combination of the pattern and matcher classes helps maintain a separation of concerns. This addition to Java has been long overdue. Have fun, and happy parsing!

Acknowledgment
The author would like to acknowledge the gracious feedback of Roger Moore of Valtech Technologies (Dallas) and Tom Wood of Valtech Technologies (Houston).

Author Bio
David Weller is a principal managing consultant at Valtech Technologies, Inc., an international consulting firmspecializing in .NET/J2EE/Unified Process development, skills transfer, and training. He holds a computer science degree from the University of Houston at Clear Lake. [email protected]

	


Listing 1


import java.io.*;
import java.util.*;
import java.util.regex.*;
public final class Simple {
     public static final void main(String args[]) {
         Pattern p = Pattern.compile("fish");
         Matcher m = p.matcher("one fish, two fish, red fish,
   blue fish");
         int count = 0;
         while (m.find()) {
             count++;
         }
         System.out.println("Count = " + count);
     }
}

Running this program yields:
D:\projects\regex>java simple
Count = 4

Listing 2

D:\projects\regex>java Replace
The cat in the hat

Here's the code itself:
import java.io.*;
import java.util.*;
import java.util.regex.*;
public final class Replace {
     public static final void main(String args[]) {
         Pattern p = Pattern.compile("fish");
         Matcher m = p.matcher("The fish in the hat");
         StringBuffer output = new StringBuffer();
         if (m.find()) {
             m.appendReplacement(output, "cat");
             m.appendTail(output);
         }
         System.out.println(output.toString());
     }
}


Listing 3

import java.io.*;
import java.util.*;
import java.util.regex.*;
public final class ReplaceAll {
     public static final void main(String args[]) {
         Pattern p = Pattern.compile("fish");
         Matcher m = p.matcher("One fish, two fish, red fish,
   blue fish");
         StringBuffer output = new StringBuffer();
         while (m.find()) {
             m.appendReplacement(output, "cat");
         }
         m.appendTail(output);
         System.out.println(output.toString());
     }
}

Listing 4

import java.io.*;
import java.util.*;
import java.util.regex.*;
public final class ShowGroups {
     public static final void main(String args[]) {
        StringBuffer input = new StringBuffer("one if by Java,
  two if by C");
        Pattern p = Pattern.compile("one(.*)two");
        Matcher m = p.matcher(input);
        while (m.find()) {
             for(int i=0;i <= m.groupCount(); i++){
                 System.out.println("Group("+i+") is \"" + m.group(i) + "\"");
             }
         }
     }
}

Listing 5

import java.io.*;
import java.util.*;
import java.util.regex.*;
public final class tagParser {
     public static final void main(String args[]) {
         Pattern p = Pattern.compile("<(.*)>(.*)</\\1>");
         String input = "This is <bold><italic>bold-italic</
   italic></bold>";
         findTag(p,input);
     }

     private static final void findTag(Pattern p, String in) {
         Matcher m = p.matcher(in);
         boolean result = m.find();
         while (result) {
             System.out.println("Found tag: " + m.group(1) + ", inner 
string = " + m.group(2) );
             findTag(p,m.group(2));
             result = m.find();
         }
     }
}

  
 

All Rights Reserved
Copyright ©  2004 SYS-CON Media, Inc.
  E-mail: [email protected]

Java and Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. SYS-CON Publications, Inc. is independent of Sun Microsystems, Inc.