HomeDigital EditionSys-Con RadioSearch Java Cd
Advanced Java AWT Book Reviews/Excerpts Client Server Corba Editorials Embedded Java Enterprise Java IDE's Industry Watch Integration Interviews Java Applet Java & Databases Java & Web Services Java Fundamentals Java Native Interface Java Servlets Java Beans J2ME Libraries .NET Object Orientation Observations/IMHO Product Reviews Scalability & Performance Security Server Side Source Code Straight Talking Swing Threads Using Java with others Wireless XML

In early 2002 Intel became the first chip manufacturer to release a processor incorporating a new technology known as Simultaneous Multithreading, or SMT. Intel's SMT implementation (dubbed Hyper-Threading or HT) has been available in their Xeon processor line for over a year, with little fanfare. In April 2003, Intel announced that HT technology will be added to its desktop-focused Pentium 4 line of processors. With HT enabled on one of these new systems, the BIOS will present a single processor to the operating system as two logical processors.

As Java developers, we should all be excited about this new feature of Intel processors. The java.lang.Thread object was one of the key factors driving Java to the strong position it enjoys in the server-side applications market. Both client and server applications written in Java often make heavy use of threads. Indeed even if an application does not use threads explicitly, all JVMs will use at least one background thread ­ the garbage collector. SMT holds the promise of significantly increasing Java's server-side performance by more completely utilizing existing processor cycles in multithreaded applications.

This article attempts to explain the concepts of Simultaneous Multithreading in layman's terms, presents the development of an n-thread benchmarking suite, and uses that suite to produce concrete results of multithreaded benchmarks on HT and non-HT systems. We'll investigate various operation types to determine the factors that affect Java performance enhancements on Hyper-Threaded processors. Finally a series of conclusions and speculations are derived from the data collected.

Understanding Symmetric Multithreading on Intel Processors
Intel processors with HT technology carry two copies of the processor's architectural state on the same chip. This second architectural state stores a second thread context. Conceptually, this type of processor architecture splits each physical processor into two or more logical processors. Physical SMT processors present themselves to the operating system as separate logical processors. As we'll see later, it can then become important for the operating system to be aware of and to differentiate between logical and physical processors. Figure 1 illustrates the difference between SMT and non-SMT processors.

Figure 1

What is the benefit of SMT? As it turns out, the more expensive processor resources can find themselves underutilized while an active thread performs long latency operations. A cache miss, for instance, will require the processor to make a request to main memory. The majority of the processor's resources remain idle for this period of time; however, the processor presents itself to the operating system as busy. SMT systems use this slice of time to execute the operations of another on-chip thread context.

SMT processors contain an onboard scheduler to interleave multiple threads operating on the physical processor. If a thread encounters a long latency, the processor will immediately execute the instructions of the second on-chip processor state. For two threads accessing the same processor resources, the onboard scheduler will interleave the threads much the same as a software thread scheduler. This interleaving has a small amount of overhead, which can decrease the efficiency of the processor in certain situations. On an aggregate basis, however, processor performance is increased.

Using SMT it becomes apparent that depending on the work that each thread is doing on adjacent logical processors, we could see performance increases or decreases. Various papers (see references) studying multithreaded performance indicate generally positive results, with some research indicating perceived performance gains as high as 50%.

HT-Enabled Systems
Intel Hyper-Threading requires support from three fundamental components of a system:

  1. The processor
  2. The chipset
  3. The operating system
Processors Supporting HT
Hyper-Threading was incorporated into the Xeon class processors in early 2002. Xeon is not to be confused with Pentium III Xeon. When Intel changed the Xeon's core to P4, it dropped the P4 designation, calling the processor simply Xeon. Recently, HT has found its way to the desktop P4 processor. Not all processors in each of these processor classes are capable of Hyper-Threading, however.

Table 1 indicates which processors support Hyper-Threading. The table also indicates factors that you can use to determine whether a given Intel processor supports HT.

Table 1

With the release of the 3.06GHz Pentium 4, Intel changed the P4 logo, incorporating the letters H and T to indicate that it's a Hyper-Threading processor.

All recent Xeon processors support Hyper-Threading, but again, be sure to watch out for the 256KB L2 Cache version, which does not.

Chipset Support for HT
Not all chipsets support HT. Check with your chipset manufacturer to ensure that you can enable and disable HT support via the BIOS.

All HT chipsets interleave processor numbering to help less sophisticated thread schedulers make complete use of available physical processors. The chipset will present the logical processors to the OS as follows:

Logical CPU0 = Physical CPU0, Logical CPU0
Logical CPU1 = Physical CPU1, Logical CPU0
Logical CPU2 = Physical CPU0, Logical CPU1
Logical CPU3 = Physical CPU1, Logical CPU1

Operating Systems Supporting HT
Given a processor and chipset that support Hyper-Threading, the operating system must also be HT aware. Table 2 shows the OS support for several currently available operating systems commonly run on Intel-based hardware.

Table 2

The Windows 2000 operating systems do not differentiate between logical and physical processors. Therefore a 32-processor HT system will support only 32 logical processors. It will work; however, the additional processor resources will not be utilized.

Windows users should check software licensing agreements to confirm that they recognize logical processors. Generally XP will support licensing on a per physical CPU basis, while Windows 2000 will see logical processors as physical processors for licensing purposes.

Figure 2 shows a Windows XP Pro task manager on a dual-processor HT system, note the four distinct "CPU Usage History" charts depicting the four logical processors.

Figure 2

The 2.4 kernel began supporting Hyper-Threading on the Intel Xeon processor as of version 2.4.18. The thread scheduler in 2.4, however, does not understand the difference between logical and physical processors, in addition to many other SMT scheduler optimizations, similar to the Windows 2000 family of products. This can lead to degraded performance in situations where two threads are scheduled concurrently on one physical processor, while the other physical processor is left idle.

As of kernel version 2.5.32, the thread scheduler was updated with advanced features to support Hyper-Threading. The 2.5.x kernel is the development branch that will become the 2.6 kernel. The exact release schedule for 2.6 is unknown, but in a recent interview Linus Torvalds indicated that 2.6 would likely be released in Q4 2003.

Figure 3 shows a Red Hat 7.3 installation running the 2.4.18 kernel with Hyper-Threading enabled on the system. Note the four CPU states indicated as CPU0-CPU3 on top. Also note that CPU0 is running at 100.1% utilization ­ wow, Hyper-Threading is cool!

Figure 3

Threaded Benchmarking on HT and Non-HT Systems
Our goal here is to understand the effects of Hyper-Threading processors on the performance of multithreaded Java applications. To do this, we need a test bed that will allow us to execute heavily threaded operations and track performance variations against thread count in HT and non-HT systems.

Thread Bench Design
At a basic level, the test bed should be able to execute multiple operations across n threads, observing the total throughput of operations per unit of time for a run. On a dual-processor system, we should see nearly double the performance on a CPU-intensive operation using two threads instead of one. The performance of CPU-intensive threaded operations on HT systems will vary based on the operations and the level of concurrency possible on a single physical processor.

Our focus here is to explore which types of operations will and will not benefit from HT technology. Given this we need to be able to quickly implement and test multiple types of operations.

There are several Java benchmarking systems available on the market. Many are older and focused on applet performance. Some newer benchmark systems like VolanoMark or SPECjbb2000 test the threaded performance of systems; however, they don't allow us to customize and focus on specific individual operations that could affect performance on an HT system.

These requirements drove the design and coding of an n-thread Java benchmark framework. The framework supports pluggable operation classes and produces plottable results for a range of thread counts from a single test suite execution.

Figure 4 presents a functional/UML diagram for the system design.

Figure 4

The resulting benchmarking framework has the following features:

  • Initialization of operations on the JIT: Modern JIT compilers will optimize "hot spots" in the code. The performance of any given operation will improve over the life of the VM, so the ThreadBench framework gives operations a chance to initialize on the JIT before the tests commence.
  • Operation abstraction: By developing a generic operation interface and using dynamic class loading and initialization of the operation to be tested, we can quickly prototype and test various processor-intensive operations.
  • Test suites: Using test suites, ThreadBench runs a given operation configuration through several iterations of the test with different numbers of threads. This allows a series of tests to be repeatedly run on several machine configurations with minimal effort.
  • Multiple runs: To smooth out anomalies in the test, each data point is created by averaging data from several runs. This is configurable; some tests have a larger standard deviation than others.

    The code for this article can be downloaded from below.

    Factors Affecting Performance
    Use of Threads

    This seems obvious; however, it needs to be mentioned: single-threaded applications (often client applications) will see little performance gain. Server-side Java applications make extensive use of threads, making them excellent candidates for performance improvement from SMT.

    Nonthreaded applications may still see some benefit. Java's garbage collection and background JIT compilers operate as daemon threads in the local JVM. In addition, concurrent processes could make use of the additional processor resources.

    The Operating System's Thread Scheduler
    In an HT system, a single physical processor is presented to the OS as two logical processors. This requires the OS to differentiate between physical and logical processors and make intelligent decisions about thread scheduling.

    The thread scheduler on a dual-processor HT system will see four logical processors. A poor thread scheduler could schedule two CPU-intensive threads onto separate logical processors representing the same physical processor. This would result in a perceived performance decrease on an HT-based system.

    CPU Resource Utilization
    Hyper-Threaded processors do not duplicate all available resources. Two threads performing fundamentally similar operations on separate logical processors will likely see little performance gain. For HT to be a benefit, the two threads coexisting on a physical CPU must perform a variety of operations to allow the processor to make better use of latency.

    Performance of Threaded Benchmarks on HT and Non-HT Systems
    Tests were run on two HT-capable dual-processor systems (see Table 3).

    Table 3

    Hyper-Threading requires BIOS support, making it easy to enable and disable the feature in the boot setup program for various runs.

    Each test was run with the Sun JDK 1.4.1_02, using the ­server flag on the Linux and XP systems. Tests were also run with the IBM 1.4.0 JVM, with no command-line flags, on the Linux system.

    The tests devised are by no means comprehensive. The goal was to stress the processor, using different processor resources, to try to gain some insight into the effects of SMT processing. The series of tests was run on each of the above systems, with and without HT enabled. Each of the operation algorithms tested is briefly described, followed by results and some discussion and interpretation.

    Note: To save space, the XP and Linux tests are shown on the same plots. The data should not be directly compared, however. The tests were run on different physical hardware, indeed the processor speeds on the XP machine were higher than on the Linux machine.

    Test 1: Gaussian Elimination, 500x500 matrix (Floating point intensive)
    Gaussian elimination is a very common algorithm used to solve systems of linear equations ­ a common task in finite element applications, weather simulation, coordinate transformations, and economic modeling among other things. Algorithmic optimizations are often done for sparse/banded matrices; however, the core of the work is fundamentally the same ­ large numbers of floating point calculations are required.

    To simulate this, a Gaussian elimination algorithm with scaled partial pivoting and back substitution is used (see Figure 5). A full matrix is constructed of random doubles using Math.random(). The population of the matrix is carried out in the setup() method and is not considered part of the operation.

    Figure 5

    This operation carries out large numbers of simple floating point operations on doubles. All calculations are done in the Java call stack, though it's highly likely that the code was optimized by the JIT before the tests were run.

    It seems that this operation does not scale well into threads on any JVM. The Sun VM on Microsoft with Hyper-Threading does significantly worse than the Linux JVMs with or without Hyper-Threading. There are no synchronizations in the operation whatsoever. Poor scaling into threads could be due to memory barriers, or contention for a bus or main memory.

    Test 2: Calculation of 2000! (Integer intensive)
    Calculation of factorial (! operator) is used often in probability calculations. It's used as a portion of the formula for combinations and permutations. Factorial is defined as follows:

    N! = 1 x 2 x 3 x 4 x S x N

    Combinations are an interesting calculation in poker, and illustrate a potential use of the factorial operator. To calculate the number of five-card combinations in a 52-card deck, we use the combinations formula:

    Possible poker hands= 52C5 =52C5=52!5! (52-5)!

    Factorial calculations of even small integers grow rapidly, requiring the use of the java.math.BigInteger class. Calculations of factorials result in a large number of integer multiplications.

    The factorial calculations shown in Figure 6 do show some consistent, limited benefit from Hyper-Threading. Indeed, for four threads the IBM JVM shows a 17% increase in performance using an HT-enabled system.

    Figure 6

    Incidentally, there are 2,598,960 five-card combinations in a 52-card deck.

    Test 3: 150K calculations of Math.tan() (Floating point, mixed stack)
    This test simply calculates the tangent of an angle 150,000 times in a tight loop (see Figure 7).

    Figure 7

    All Java threads have two call stacks: one for Java calls, the other for C calls. The java.lang.Math.tan(double) function is native, calculating an approximation of tangent with a 27th order polynomial. It's likely that the reason this operation scales so well into Hyper-Threading is the constant call stack switching, giving the processor time to utilize its secondary thread context.

    Test 4: Prime number search
    A prime number search operation was created using the BigInteger class and a very simplistic direct search factorization. The poor algorithm is not as important as the type of calculations being performed. This class performs a large number of BigInteger divisions.

    It is difficult to tell what is going on in Figure 8, beyond the fact that the IBM JVM is beating Sun's. The IBM JVM scales well into threading this operation. It does even better when Hyper-Threading is enabled. The Sun VM scales poorly into threads, and it becomes worse with additional thread contexts. You could speculate that this behavior is characteristic of a low-level synchronization contention issue in the Sun JVM.

    Figure 8

    Testing Summary
    The plots above give some general idea of how these various operations scale into threads. In most cases, the HT performance gains are modest. The following is a summary of performance differences seen with Hyper-Threading enabled versus disabled for each of the tested JVMs.

    IBM 1.4.0, Linux 2.4.18
    8-0.69%  2.41%

    Sun 1.4.1, Linux 2.4.18
    8-4.73%  -28.28%

    Sun 1.4.1, Windows XP Pro
    8-23.66%  -23.36%

    When I began this project, I fully expected to see marked performance gains using Hyper-Threading over identical hardware not using HT. In the course of testing, I've learned quite a bit about performance differences for Java on various platforms, hardware configurations, and virtual machines. Hyper-Threading is not the boon I had expected. In some situations, performance gains for HT reached the 75% mark, which is considerable. There was little significant performance degradation using HT, so using it seems to be largely on the upside.

    Perhaps the more important finding is that the IBM JVMs perform significantly better than the Sun JVMs. In addition, the IBM JVMs scaled far better with threads than did Sun's offering. If performance is of key concern, and you're not using some of the more esoteric features of the Sun JVM, IBM JVMs deserve serious consideration.

    Most server-side Java applications are not doing computationally intensive tasks. The tasks focus more heavily on socket IO ­ communicating with databases, clients via HTTP, RMI, Web services, and the like. Processors will be given plenty of socket IO wait time to schedule parallel tasks. For socket-IO-bound applications, be sure to consider the relative skill of your operating system in the IP arena.

    The introduction of Hyper-Threading on desktop P4 systems is also exciting. Java developers often develop on Windows or Linux-based desktop systems and deploy onto larger SMP and potentially SMT systems. HT will allow a desktop developer and user to see some of the benefits of threaded applications long before deployment to the higher-end systems.

    SMT technology is here to stay. Intel's Hyper-Threading implementation is sure to be the first of many. Chip industry watchers speculate that Simultaneous Multithreading and thread-level parallelism will spell the ultimate end of the "megahertz wars." A chip's performance will be tied less to its internal clock speed and more to the bells and whistles it incorporates. Other chip manufacturers are sure to follow suit, and all implementations will improve in quality over time.

    Operating systems are also continually improving their support for Hyper-Threading. It does seem strange that the performance on an XP system, which should be HT optimized, was often less HT friendly than the 2.4.18 Linux kernel, which is HT ignorant. As more sophisticated support for HT is built into operating systems, we should see more significant performance gains using HT in the Java world.

    The combination of Java and Linux in the datacenter is rapidly gaining ground on the Solaris/Java platform. The majority of these new Linux servers are running high-end Intel-based hardware. Hyper-Threading will give this trend a further push in the Linux direction.

    For now, given a piece of hardware that's HT capable, the configuration that offers the best performance under most conditions is the IBM 1.4.0 JVM on Linux with Hyper-Threading enabled.


  • Microsoft license clarification for SMT systems: www.microsoft.com/nz/licensing/downloads/ hyper_threading_processors_licensing_brief.doc

    Intel Processor Specsheets

  • Xeon: www.intel.com/products/server/processors/ server/xeon/index.htm
  • Xeon DP: www.intel.com/design/xeon/prodbref/index.htm
  • Xeon MP: www.intel.com/products/server/processors/ server/xeon_mp/index.htm
  • Pentium 4: http://developer.intel.com/design/pentium4/ datashts/298643.htm
  • P4 Chipset matrix indicating HT support: www.intel.com/design/chipsets/linecard.htm
  • IBM Whitepaper on Linux and Hyper-Threading: www-106.ibm.com/developerworks/linux/ library/l-htl/?dwzone=linux
  • LinuxWorld article indicating Q4 2003 release of 2.6 Kernel: www.linuxworld.com/story/33805.htm


  • Physical processor: A silicon-based hardware processor
  • Logical processor: A hardware/software system making pseudo-parallel use of a single physical processor
  • Simultaneous Multithreading (SMT): The use of logical processors to increase processing throughput on a single physical processor
  • Symmetric Multiprocessing (SMP): The use of multiple physical processors in parallel, each running separate threads of execution
  • Hyper-Threading: Intel's marketing name for its SMT technology on Xeon and Pentium 4 processors

    About The Author
    Paul Bemowski is an independent consultant, focusing on Java solutions to enterprise computing problems. [email protected]

    "Hyper-Threading Java"
    Vol. 8, Issue 8, p. 45

    Source Code for this Article zip file ~227 KB

    All Rights Reserved
    Copyright ©  2004 SYS-CON Media, Inc.
      E-mail: [email protected]

    Java and Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. SYS-CON Publications, Inc. is independent of Sun Microsystems, Inc.