How fast should our Java code be to be considered fast? After all, speed is a relative concept. I'll compare the results of CPU performance for the following JVMs: Sun's J2SE 1.4.1, 1.4.0, 1.3.1, and Jikes. These results can be used to make a number of educated decisions such as choosing a JVM, deciding on algorithmic designs, and selecting the right method from the API. They provide an overall assessment of performance that's not custom related since the code used is quite common and drawn directly from Sun's Java APIs.
This article studies the Java APIs for an extra boost in performance. It's not a new idea, and is often referred to as micro-benchmarking (MBM). However, a systematic and thorough performance analysis at that level is still missing. Herein, I'll address performance as speed, measured in wall clock time. I'll cover various Java Virtual Machines (JVMs), and show that the results differ significantly. A study of memory consumption is also warranted, but it will not be addressed here; for such an analysis, visit www.marmanis.com. See the Resources section for references to performance studies.
When dealing with performance, one of the major difficulties is the many "scales" or layers that are usually involved in Java applications, especially in enterprise Java applications. I categorize performance problems based on their scale:
Only categories 1, 2, and 3 are directly related to the Java programming language. Problems in category 1 can be dealt with or, even better, prevented, by the proper use of J2EE Blueprints, the Java version of design patterns for enterprise applications (see http://java.sun.com/blueprints/enterprise and www.theserverside.com/patterns/index.jsp). Categories 4 and 5 involve handling and testing components that may be irrelevant to Java per se. These categories are given in order of decreasing importance. Experience shows that you will get most of your performance increase from improvements in categories 1 and 2, regardless of how you measure performance! Nevertheless, if you want to squeeze as much performance as possible out of your infrastructure, it's worth knowing what the performance of your fundamental APIs is. This part of performance is aptly called micro-performance and it belongs to category 3.
- System architecture
- Algorithm selection
- Code implementation
- System configuration
- System infrastructure
This article addresses the performance of Java applications with respect to the underlying JVM, and is based only on the standard API classes and algorithms that are implemented by them. We do this in order to establish results that are widespread in their applicability. No matter what code you write, you can decompose it into parts that can be studied as individual units. The performance of the whole is equal to the performance of its parts plus the overhead of the interaction between the parts. The simplest parts that you can decompose are the classes that are offered by the Java API. Thus, knowing how well these parts perform can be crucial to the overall performance of your product.
Before we go any further, I'll justify why it's a good idea to know about micro-performance. It may be considered unnecessary to examine the performance at its finest granularity. However, most performance-tuning strategies neglect the fact that the right choice, at the micro level of the code, can squeeze out speed without making the code more complicated or error prone. Moreover, it is something that everyone can do; no special training is needed to choose the method that is the fastest of a variety of possible methods and equally effective for the task at hand.
Typically, an engineering team will employ a profiling tool that will pinpoint the location of "hotspots" (memory- and/or CPU-intensive code fragments). This is certainly a valid way of improving performance, but it says nothing about global performance. If your code throughout is slow, then a profiling tool won't help.
Global performance excellence stems from a global enforcement of best practices, i.e., fast implementations throughout the code. We would like to see extremely fast Java applications, especially enterprise applications, and our maxim is that optimal code should be used everywhere in the source code. A fitting analogy here is the saying, "A water tower can be filled one teaspoonful at a time." Best practices can be revealed by a detailed study of the available APIs and documentation of the findings. Implementing the fastest code doesn't necessarily mean the code will be more error prone; you can be fast and accurate simultaneously. I hope this article contributes toward that end.
For the purpose of illustrating micro-performance benchmarks, we will use four Java Virtual Machines. Three of them (J2SE 1.3.1_07-b02, 1.4.0_03-b04, and 1.4.1-b21) are provided by Sun Microsystems, Inc. (http://java.sun.com/j2se/), and the other (Jikes 1.3.0) is provided by IBM (http://oss.software.ibm.com/
developerworks/opensource/jikes/). One of the design guidelines for Sun's version 1.4 was to improve the performance and scalability of the Java platform; you can read more about their specific rationale at http://java.sun.com/j2se/1.4/performance.guide.html.
The bytecode for each run was created by the compiler that comes with each distribution. (The source code can be downloaded from below.) All the compilers were invoked as follows (see also the scripts that are provided):
%JVM_HOME%\bin\javac -g:none -O Test[i].java
where %JVM_HOME% is the path to the distribution that we target, and Test[i].java is the Java class that corresponds to one of our tests (e.g., Test2.java). The flag -O eliminates optional tables in the class files, such as line number and local variable tables. This provides only a small performance improvement on the generated code, although if our class files had been sent across a network, it could have helped significantly. IBM and Sun have no plans for bytecode optimization; they'd rather focus on runtime optimization (see www.nejug.org/2000/sept00_slides/javaperf.htm).
Once the bytecodes were created, they were executed by the target runtime:
%JVM_HOME%\bin\java -server %JVM_XMX% %JVM_XMS% Test[i]
where %JVM_HOME% is again the path to the distribution that we target, and Test[i] is the Java bytecode that corresponds to one of our tests (e.g., Test2.class). The JVMs by Sun offer the following options: client, server, and hotspot.
For the 1.4.x versions, the hotspot is a synonym for the client JVM. I've chosen to use the server JVM, although it should be easy for the user to experiment with the client JVM by changing the respective flag in the scripts. In general, the differences between the server and the client versions are related to the JVM tuning. The client JVM is tuned to reduce application startup time and memory footprint, which is important when running desktop applications. The server JVM is intended for use in server applications where the JVM will run for long times and peak performance is more important than footprint and rapid startup. Both options are of interest, although Java is clearly more prevalent on the server side.
Last, I've also chosen to fix the size of the heap in order to remove the burden of resizing the memory, which is one of the garbage collection responsibilities. This doesn't prohibit the garbage collector from doing its work. The question that we really ask is this: Given a fixed amount of memory for each JVM, which JVM performs the exact same code faster? During that time - not longer than a few minutes in the worst case - the JVM that spends the least time dealing with garbage collection will have an advantage over the other JVMs.
I'll employ only one operating system platform, namely, the Windows 2000 Professional. However, it should be clear that for a complete and useful assessment of micro-performance, the same benchmarks should be run for other operating systems as well, such as Linux, Solaris, and AIX. The Windows system that I'll use runs on a Dell Inspiron 4100, with total physical memory of 654,776KB; BIOS PLUS Version 1.10 A09; and x86 Family 6 Model 11 Intel CPU at 1,100MHz. Results for a Linux system that runs on a Micro PC, with total physical memory of 772,856KB and an AMD Athlon Processor at 1,134MHz, should be available on my Web site. For the Linux platform, there is also a JVM that's offered by the Blackdown open source project (see www.blackdown.org). There are more JVM implementations available and I'll make an effort to include as many of them as is possible on my Web site.
I'll present 19 benchmarks that cover some of the following: basic arithmetic operations, java.lang package, java.io package, java.util package, and java.security package. I refer to each test by concatenating the character "T" and the respective enumeration of the test. Thus T1 will refer to Test 1, T2 will refer to Test 2, and so on. Obviously, this is not an exhaustive list and the choices are based on what I think are popular method calls.
The code for the benchmarks was written with simplicity in mind. The theme is the same for all benchmarks. They all consist of some setup code and some code inside a for loop whose timing is the goal of each benchmark. Thus, each benchmark repeats for a "reasonable" time a call to a small piece of Java code. I use System.currentTimeMillis() to measure the wall clock time (in ms). To take into consideration the time spent for the loop, I always run a baseline test first to obtain the reference time, i.e., the time spent in a loop without code in it. Rather than subtracting the reference time from the reported value, I report both. I'll also report the number of iterations since this varies among our benchmark tests. I'll often include several method calls inside the loop so we can examine the efficacy of some API classes in an aggregate fashion; see, for example, the benchmark Test5.java. A more granular approach is, of course, what this article proposes and I'll publish and maintain on the Web more fine-grained results.
Let's now see what each test measures and analyze its results. All values refer to ms and the loop size was chosen so that I'd weed out any fluctuations of the CPU due to unrelated processes. Figure 1 shows a snapshot of the list of services that run during the test and Figure 2 shows a snapshot of the process list from the TaskManager of the Windows OS. Inside parentheses I include the ratio of performance for each JVM when compared to the Sun 1.4.1 JVM. Hence, a value smaller than 1 means that the JVM is faster than Sun's 1.4.1 JVM, and a value larger than 1 means that the JVM is slower than Sun's 1.4.1 JVM; therefore the value inside the parentheses will always be equal to 1 for the last column.
1. T1 measures the performance for typical numerical operations. I use a long and two double numbers, and perform an addition, a multiplication, and a division with constant numbers.
2. T2 defines a number of variables of type string and uses the method equals to compare them. Inside the loop I use several different string comparisons, since the speed of the algorithm is not uniform across all possible strings. Hence, my results will give a good estimate of the method's performance for strings that are equal in length or vary by one character.
3. T3 has the same setup as T2 but uses the method equalsIgnoreCase.
4. T4 has the same setup as T2 and T3 but uses the method compareTo.
5. T5 tests the performance of some commonly used mathematical functions. The class java.lang.Math has various useful mathematical functions. We test the method that creates a random number, random(); the method that calculates the cosine of an angle, cos; the method round that returns the closest long to its argument of type double; the method that calculates the absolute value of a number, abs; the methods that return the exponential and the logarithm of an argument, exp and log, respectively; the method that gives us the maximum between two numbers, max; the method that gives us the square root of a number, sqrt; and finally the method that raises one number to a certain power, pow.
6. Another common task for many applications is the reading of a properties file. This is usually done by reading the file via the java.io API and loading the values on a Properties class that's provided in the java.util package. I do just that in T6.
7. As you may very well know by now, string concatenation is much slower than the append method of a StringBuffer. T7 tests how quickly this works for the various JVMs under study. Some time is spent to find the length of the StringBuffer each time and delete all its content so that the StringBuffer is empty at the beginning of each iteration. In total, we have six append calls, one length, and one delete.
8. In T8 I measure the performance of the various classes that are needed to encrypt and decrypt a 128-character string. I use the SunJCE provider and a triple DES key spec. In particular, I initialize the Cipher with "DESede/ECB/PKCS5Padding." Jikes does not run with exactly the same code in this case, so an N/A appears in the corresponding position of the table.
9. In T9 I serialize, write to the disk, read from the disk, and deserialize a Vector object.
10. In T10 I test the performance of the method add in an ArrayList.
11. In T11 I test the performance of the method add in a Vector.
12. In T12 I test the performance of the method add in a HashSet.
13. Since the above three tests add Random objects in the various collections, I run a test, (T13), that measures the time that the generation of these objects takes. These times will be referenced side-by-side with the times that I obtain from T10, T11, and T12.
14. In T14 I test the performance of the method remove in an ArrayList.
15. In T15 I test the performance of the method remove in a Vector.
16. In T16 I test the performance of the method remove in a HashSet.
17. A typical way of iterating through the elements of a collection is by using the Iterator object. In T17, I obtain the Iterator of an ArrayList and iterate through all its elements.
18. In T18 I obtain the Iterator of a Vector and iterate through all its elements.
19. In T19 I obtain the Iterator of a HashSet and iterate through all its elements.
I cannot emphasize enough the importance of the data on the performance of a method call when attempting to do micro-benchmarks. The space that is usually covered by the arguments of an operation, or of a method call, is vast. Although indicative values can be obtained, the results are applicable only for the space of the arguments that they cover. With that "alert" status in mind, let's proceed and draw some conclusions.
The results for T1 (see Table 1) show that the Sun JVMs take advantage of the fact that the second operand is a constant and achieves some aggressive optimization of arithmetic operations. Nevertheless, all JVMs are quite fast and achieve billions of operations per second.
The results of T2, T3, and T4 (see Table 2) show that compareTo is two to three orders of magnitude faster than either of the equals methods. In my study - not shown here - I have found that the method call equalsIgnoreCase of the class String is a lot faster than the method call equals of the same class when the majority of the compared data have different lengths. There is a good reason for this, of course. The equalsIgnoreCase first checks the length of the two strings. Nevertheless, the point is that if you know that piece of information and you happened to extensively use string comparisons, you can take advantage of it without paying a penalty; if case does matter, then obviously this is not appropriate!
The results of T5 (see Table 3) show that the mathematical functions are two orders of magnitude faster with Jikes than with any of the Sun JVMs. Hence, if you rely heavily on computing cosines and logarithms, you should definitely take that into consideration when picking your JVM. However, the results of T6 and T7 show that there is not really a difference between the JVMs when it comes to loading a properties file and using the append method, respectively.
The results of T8 (see Table 4) show that encryption-related methods in the Sun 1.4.x JVMs are an order of magnitude faster than the same methods in Sun 1.3.1. Thus, if encrypting and decrypting is bread and butter for you, you have one more reason to upgrade your JVM!
The results of T9 (see Table 5) show that serialization to (upper readings) and deserialization from (lower readings) a file with the Sun 1.3.1 JVM is faster than any other JVM. Jikes is faster than both of the Sun 1.4.x JVMs. However, all the JVMs are in the same order of magnitude in terms of the time spent to accomplish the task.
The results of T10, T11, and T12 (see Table 6) show that the add method is equally fast for an ArrayList and a Vector. However, Jikes is faster by a factor of at least two, regardless of the Collection class that's used. A somewhat disturbing result is that the Sun 1.4.x JVMs seem to be slower than the Sun 1.3.1 JVM for the ArrayList and the Vector classes. This result consistently appeared in the runs that were made in preparation for this article, so there should be a reason for it. As we'll see later, that doesn't happen with the remove method or the iterator. It would be nice if the Sun engineers would take a look at it.
As expected, the add method in the HashSet class is slower than the same method for the ArrayList and the Vector classes. It's extremely slow in Sun 1.3.1, by three orders of magnitude when compared to all other JVMs. The removal and the iteration in the HashSet class are also slow with the Sun 1.3.1 JVM, by an order of magnitude compared to all the other JVMs.
The results of T14 and T15 (see Table 7) show that the remove method is equally fast regardless of the JVM. T16 shows that the same method is faster for a HashSet than for an ArrayList or a Vector; however, the method is an order of magnitude slower among the tested JVMs for the Sun 1.3.1 JVM.
Finally, the results of T17, T18, and T19 (see Table 8) show that the iteration with the Sun JVMs is faster than the iteration with Jikes by, at least, a factor of two; the caveat here is the problematic iteration of the HashSet, mentioned earlier.
The study presented here should not be considered complete. The purpose of this article is to distinguish the many scales that may affect the performance of Java applications, pay particular attention to what I call micro-performance, and to suggest one way to tackle the problem through detailed benchmarks of the APIs. The article demonstrates these ideas by employing a very small, but quite popular, portion of the APIs.
The performance of the Java API method calls clearly depends on the JVM. However, the differences are not uniform - you can't claim that JVM-1 is always faster than JVM-2. That in itself is not news, of course, but it is important to quantify the differences because you may find that for your own application JVM-1 is better than JVM-2, and you may want to instill micro-benchmarking in your own or your team's coding practice.
In the competitive market of enterprise applications, it's worthwhile to get as much performance as you can out of the standard APIs. To know how to get that advantage, we need to quantify the performance of the Java language at the level of its APIs. The point is that if you can get faster code without "collateral damage" - to use a, regrettably, quite fashionable term at the time of this writing - why not do it?
It was my initial desire to include the analysis of the GC output in this article; however, this would double, if not triple, its size. Nonetheless, I strongly recommend you collect such output (e.g., by using -verbose:gc as a flag to the JVM) and observe how the various JVMs collect their garbage over time. That's quite instructive. In addition, if you feel the urge to truly understand your JVM, use its optional flags as new parameters in your analysis of the results.
For any application of substantial size and complexity, a proper micro-performance tuning may produce significant speedups. Not comparable to the speedups that you can get by choosing a better architecture, or a better algorithmic approach; at the first stages of performance tuning even a factor of 10 is possible, in some cases. But when your architecture is appropriate and your algorithms optimal, it is likely that micro-tuning your application is a nice-to-have weapon in your arsenal. If you have already adopted micro-tuning during your code implementation, you're probably grinning right now in a self-satisfied manner!
Resources Shirazi, J. (2003). Java Performance Tuning, 2nd Edition. O'Reilly & Associates, Inc.
Wilson, S., and Kesselman, J. (2000). Java Platform Performance: Strategies and Tactics. Addison-Wesley Pub Co. http://java.sun.com/docs/books/performance/
BEA WebLogic JRockit Virtual Machine: www.bea.com/products/weblogic/jrockit/index.shtml
Java 2 Platform, Standard Edition (J2SE): http://java.sun.com/j2se/
Blackdown Project: www.blackdown.org
Java BluePrints: http://java.sun.com/blueprints/enterprise/
Java 2 Platform, Standard Edition (J2SE): http://java.sun.com/j2se/1.4/performance.guide.html
Haggar, P. "Improving Java Code Performance": www.nejug.org/2000/sept00_slides/javaperf.htm
About The Author
Haralambos Marmanis is a software architect at Zeborg. He has more than 12 years of software development experience in academia and the industry. He received his PhD in applied mathematics and scientific computing from Brown University. His interest is in multitier, high performance, enterprise software.
"Performance of Java Compilers: An Empirical Study"
Vol. 8, Issue 6, p. 60