In early 2002 Intel became the first chip manufacturer to release a
processor incorporating a new technology known as Simultaneous
Multithreading, or SMT. Intel's SMT implementation (dubbed Hyper-Threading
or HT) has been available in their Xeon processor line for over a year, with
little fanfare. In April 2003, Intel announced that HT technology will be
added to its desktop-focused Pentium 4 line of processors. With HT enabled
on one of these new systems, the BIOS will present a single processor to the
operating system as two logical processors.
As Java developers, we should all be excited about this new feature of
Intel processors. The java.lang.Thread object was one of the key factors
driving Java to the strong position it enjoys in the server-side
applications market. Both client and server applications written in Java
often make heavy use of threads. Indeed even if an application does not use
threads explicitly, all JVMs will use at least one background thread the
garbage collector. SMT holds the promise of significantly increasing Java's
server-side performance by more completely utilizing existing processor
cycles in multithreaded applications.
This article attempts to explain the concepts of Simultaneous
Multithreading in layman's terms, presents the development of an n-thread benchmarking suite, and uses that suite to produce concrete results of multithreaded benchmarks on HT and non-HT systems. We'll investigate various operation types to determine the factors
that affect Java performance enhancements on Hyper-Threaded processors.
Finally a series of conclusions and speculations are derived from the data collected.
Understanding Symmetric Multithreading on Intel Processors
Intel processors with HT technology carry two copies of the processor's
architectural state on the same chip. This second architectural state stores
a second thread context. Conceptually, this type of processor architecture
splits each physical processor into two or more logical processors. Physical
SMT processors present themselves to the operating system as separate
logical processors. As we'll see later, it can then become important for the
operating system to be aware of and to differentiate between logical and
physical processors. Figure 1 illustrates the difference between SMT and
non-SMT processors.
What is the benefit of SMT? As it turns out, the more expensive
processor resources can find themselves underutilized while an active thread
performs long latency operations. A cache miss, for instance, will require
the processor to make a request to main memory. The majority of the
processor's resources remain idle for this period of time; however, the
processor presents itself to the operating system as busy. SMT systems use
this slice of time to execute the operations of another on-chip thread
context.
SMT processors contain an onboard scheduler to interleave multiple
threads operating on the physical processor. If a thread encounters a long
latency, the processor will immediately execute the instructions of the
second on-chip processor state. For two threads accessing the same processor
resources, the onboard scheduler will interleave the threads much the same
as a software thread scheduler. This interleaving has a small amount of
overhead, which can decrease the efficiency of the processor in certain
situations. On an aggregate basis, however, processor performance is
increased.
Using SMT it becomes apparent that depending on the work that each
thread is doing on adjacent logical processors, we could see performance
increases or decreases. Various papers (see references) studying
multithreaded performance indicate generally positive results, with some
research indicating perceived performance gains as high as 50%.
HT-Enabled Systems
Intel Hyper-Threading requires support from three fundamental components
of a system:
- The processor
- The chipset
- The operating system
Processors Supporting HT
Hyper-Threading was incorporated into the Xeon class processors in early
2002. Xeon is not to be confused with Pentium III Xeon. When Intel changed
the Xeon's core to P4, it dropped the P4 designation, calling the processor
simply Xeon. Recently, HT has found its way to the desktop P4 processor. Not
all processors in each of these processor classes are capable of
Hyper-Threading, however.
Table 1 indicates which processors support Hyper-Threading. The table
also indicates factors that you can use to determine whether a given Intel
processor supports HT.
With the release of the 3.06GHz Pentium 4, Intel changed the P4 logo,
incorporating the letters H and T to indicate that it's a Hyper-Threading
processor.
All recent Xeon processors support Hyper-Threading, but again, be sure
to watch out for the 256KB L2 Cache version, which does not.
Chipset Support for HT
Not all chipsets support HT. Check with your chipset manufacturer to
ensure that you can enable and disable HT support via the BIOS.
All HT chipsets interleave processor numbering to help less
sophisticated thread schedulers make complete use of available physical
processors. The chipset will present the logical processors to the OS as
follows:
Logical CPU0 = Physical CPU0, Logical CPU0
Logical CPU1 = Physical CPU1, Logical CPU0
Logical CPU2 = Physical CPU0, Logical CPU1
Logical CPU3 = Physical CPU1, Logical CPU1
Operating Systems Supporting HT
Given a processor and chipset that support Hyper-Threading, the
operating system must also be HT aware. Table 2 shows the OS support for
several currently available operating systems commonly run on Intel-based
hardware.
Windows
The Windows 2000 operating systems do not differentiate between logical
and physical processors. Therefore a 32-processor HT system will support
only 32 logical processors. It will work; however, the additional processor
resources will not be utilized.
Windows users should check software licensing agreements to confirm that
they recognize logical processors. Generally XP will support licensing on a
per physical CPU basis, while Windows 2000 will see logical processors as
physical processors for licensing purposes.
Figure 2 shows a Windows XP Pro task manager on a dual-processor HT
system, note the four distinct "CPU Usage History" charts depicting the four
logical processors.
Linux
The 2.4 kernel began supporting Hyper-Threading on the Intel Xeon
processor as of version 2.4.18. The thread scheduler in 2.4, however, does
not understand the difference between logical and physical processors, in
addition to many other SMT scheduler optimizations, similar to the Windows
2000 family of products. This can lead to degraded performance in situations
where two threads are scheduled concurrently on one physical processor,
while the other physical processor is left idle.
As of kernel version 2.5.32, the thread scheduler was updated with
advanced features to support Hyper-Threading. The 2.5.x kernel is the
development branch that will become the 2.6 kernel. The exact release
schedule for 2.6 is unknown, but in a recent interview Linus Torvalds
indicated that 2.6 would likely be released in Q4 2003.
Figure 3 shows a Red Hat 7.3 installation running the 2.4.18 kernel with
Hyper-Threading enabled on the system. Note the four CPU states indicated as
CPU0-CPU3 on top. Also note that CPU0 is running at 100.1% utilization
wow, Hyper-Threading is cool!
Threaded Benchmarking on HT and Non-HT Systems
Our goal here is to understand the effects of Hyper-Threading processors
on the performance of multithreaded Java applications. To do this, we need a
test bed that will allow us to execute heavily threaded operations and track
performance variations against thread count in HT and non-HT systems.
Thread Bench Design
At a basic level, the test bed should be able to execute multiple
operations across n threads, observing the total throughput of operations
per unit of time for a run. On a dual-processor system, we should see nearly
double the performance on a CPU-intensive operation using two threads
instead of one. The performance of CPU-intensive threaded operations on HT
systems will vary based on the operations and the level of concurrency
possible on a single physical processor.
Our focus here is to explore which types of operations will and will not
benefit from HT technology. Given this we need to be able to quickly
implement and test multiple types of operations.
There are several Java benchmarking systems available on the market.
Many are older and focused on applet performance. Some newer benchmark systems like VolanoMark or SPECjbb2000 test the threaded performance of systems; however, they don't allow us to customize and focus on specific individual operations that could affect performance on an HT system.
These requirements drove the design and coding of an n-thread Java
benchmark framework. The framework supports pluggable operation classes and
produces plottable results for a range of thread counts from a single test
suite execution.
Figure 4 presents a functional/UML diagram for the system design.
The resulting benchmarking framework has the following features:
Initialization of operations on the JIT: Modern JIT compilers will optimize "hot spots" in the code. The performance of any given operation
will improve over the life of the VM, so the ThreadBench framework gives
operations a chance to initialize on the JIT before the tests commence.
Operation abstraction: By developing a generic operation interface and using dynamic class loading and initialization of the operation to be tested, we can quickly prototype and test various processor-intensive operations.
Test suites: Using test suites, ThreadBench runs a given operation configuration through several iterations of the test with different numbers of threads. This allows a series of tests to be repeatedly run on several
machine configurations with minimal effort.
Multiple runs: To smooth out anomalies in the test, each data point is created by averaging data from several runs. This is configurable; some tests have a larger standard deviation than others.
The code for this article can be downloaded from below.
Factors Affecting Performance
Use of Threads
This seems obvious; however, it needs to be mentioned: single-threaded
applications (often client applications) will see little performance gain.
Server-side Java applications make extensive use of threads, making them
excellent candidates for performance improvement from SMT.
Nonthreaded applications may still see some benefit. Java's garbage
collection and background JIT compilers operate as daemon threads in the
local JVM. In addition, concurrent processes could make use of the
additional processor resources.
The Operating System's Thread Scheduler
In an HT system, a single physical processor is presented to the OS as
two logical processors. This requires the OS to differentiate between
physical and logical processors and make intelligent decisions about thread
scheduling.
The thread scheduler on a dual-processor HT system will see four logical
processors. A poor thread scheduler could schedule two CPU-intensive threads
onto separate logical processors representing the same physical processor.
This would result in a perceived performance decrease on an HT-based system.
CPU Resource Utilization
Hyper-Threaded processors do not duplicate all available resources. Two
threads performing fundamentally similar operations on separate logical
processors will likely see little performance gain. For HT to be a benefit,
the two threads coexisting on a physical CPU must perform a variety of
operations to allow the processor to make better use of latency.
Performance of Threaded Benchmarks on HT and Non-HT Systems
Tests were run on two HT-capable dual-processor systems (see Table 3).
Hyper-Threading requires BIOS support, making it easy to enable and
disable the feature in the boot setup program for various runs.
Each test was run with the Sun JDK 1.4.1_02, using the server flag on
the Linux and XP systems. Tests were also run with the IBM 1.4.0 JVM, with
no command-line flags, on the Linux system.
The tests devised are by no means comprehensive. The goal was to stress
the processor, using different processor resources, to try to gain some
insight into the effects of SMT processing. The series of tests was run on
each of the above systems, with and without HT enabled. Each of the
operation algorithms tested is briefly described, followed by results and
some discussion and interpretation.
Note: To save space, the XP and Linux tests are shown on the same plots.
The data should not be directly compared, however. The tests were run on
different physical hardware, indeed the processor speeds on the XP machine
were higher than on the Linux machine.
Test 1: Gaussian Elimination, 500x500 matrix (Floating point intensive)
Gaussian elimination is a very common algorithm used to solve systems of
linear equations a common task in finite element applications, weather
simulation, coordinate transformations, and economic modeling among other
things. Algorithmic optimizations are often done for sparse/banded matrices;
however, the core of the work is fundamentally the same large numbers of
floating point calculations are required.
To simulate this, a Gaussian elimination algorithm with scaled partial
pivoting and back substitution is used (see Figure 5). A full matrix is
constructed of random doubles using Math.random(). The population of the
matrix is carried out in the setup() method and is not considered part of
the operation.
This operation carries out large numbers of simple floating point
operations on doubles. All calculations are done in the Java call stack,
though it's highly likely that the code was optimized by the JIT before the
tests were run.
It seems that this operation does not scale well into threads on any
JVM. The Sun VM on Microsoft with Hyper-Threading does significantly worse
than the Linux JVMs with or without Hyper-Threading. There are no
synchronizations in the operation whatsoever. Poor scaling into threads
could be due to memory barriers, or contention for a bus or main memory.
Test 2: Calculation of 2000! (Integer intensive)
Calculation of factorial (! operator) is used often in probability
calculations. It's used as a portion of the formula for combinations and
permutations. Factorial is defined as follows:
N! = 1 x 2 x 3 x 4 x S x N
Combinations are an interesting calculation in poker, and illustrate a
potential use of the factorial operator. To calculate the number of
five-card combinations in a 52-card deck, we use the combinations formula:
Possible poker hands= 52C5 =52C5=52!5! (52-5)!
Factorial calculations of even small integers grow rapidly, requiring
the use of the java.math.BigInteger class. Calculations of factorials result
in a large number of integer multiplications.
The factorial calculations shown in Figure 6 do show some consistent,
limited benefit from Hyper-Threading. Indeed, for four threads the IBM JVM
shows a 17% increase in performance using an HT-enabled system.
Incidentally, there are 2,598,960 five-card combinations in a 52-card
deck.
Test 3: 150K calculations of Math.tan() (Floating point, mixed stack)
This test simply calculates the tangent of an angle 150,000 times in a
tight loop (see Figure 7).
All Java threads have two call stacks: one for Java calls, the other for
C calls. The java.lang.Math.tan(double) function is native, calculating an
approximation of tangent with a 27th order polynomial. It's likely that the
reason this operation scales so well into Hyper-Threading is the constant
call stack switching, giving the processor time to utilize its secondary
thread context.
Test 4: Prime number search
A prime number search operation was created using the BigInteger class
and a very simplistic direct search factorization. The poor algorithm is not
as important as the type of calculations being performed. This class
performs a large number of BigInteger divisions.
It is difficult to tell what is going on in Figure 8, beyond the fact
that the IBM JVM is beating Sun's. The IBM JVM scales well into threading
this operation. It does even better when Hyper-Threading is enabled. The Sun
VM scales poorly into threads, and it becomes worse with additional thread
contexts. You could speculate that this behavior is characteristic of a
low-level synchronization contention issue in the Sun JVM.
Testing Summary
The plots above give some general idea of how these various operations
scale into threads. In most cases, the HT performance gains are modest. The
following is a summary of performance differences seen with Hyper-Threading
enabled versus disabled for each of the tested JVMs.
IBM 1.4.0, Linux 2.4.18
| Threads | Gauss | Factorial | Math.tan() | Prime |
| 1 | 4.13% | 3.92% | -0.10% | 3.06% |
| 2 | 1.92% | 7.39% | 1.62% | -2.42% |
| 3 | 0.21% | 11.45% | 34.99% | 1.96% |
| 4 | -2.58% | 16.98% | 75.84% | 9.84% |
| 6 | -3.56% | 13.33% | 60.96% | 4.53% |
| 8 | -0.69% | | | 2.41% |
Sun 1.4.1, Linux 2.4.18
| Threads | Gauss | Factorial | Math.tan() | Prime |
| 1 | 0.99% | 0.28% | -0.75% | 0.30% |
| 2 | -1.20% | 0.35% | -1.76% | 6.10% |
| 3 | -2.20% | 8.21% | 23.76% | 6.30% |
| 4 | -3.63% | 8.28% | 62.74% | -30.08% |
| 6 | -4.13% | 7.71% | 62.96% | -27.50% |
| 8 | -4.73% | | | -28.28% |
Sun 1.4.1, Windows XP Pro
| Threads | Gauss | Factorial | Math.tan() | Prime |
| 1 | -0.51% | 0.93% | 0.62% | -1.32% |
| 2 | -1.18% | 0.98% | -6.17% | 14.07% |
| 3 | -12.90% | 3.53% | 7.85% | -0.74% |
| 4 | -23.96% | 4.61% | 11.74% | -24.14% |
| 6 | -23.23% | 6.35% | 11.79% | -23.46% |
| 8 | -23.66% | | | -23.36% |
Conclusion
When I began this project, I fully expected to see marked performance
gains using Hyper-Threading over identical hardware not using HT. In the
course of testing, I've learned quite a bit about performance differences
for Java on various platforms, hardware configurations, and virtual
machines. Hyper-Threading is not the boon I had expected. In some
situations, performance gains for HT reached the 75% mark, which is
considerable. There was little significant performance degradation using HT, so using it seems to be largely on the upside.
Perhaps the more important finding is that the IBM JVMs perform
significantly better than the Sun JVMs. In addition, the IBM JVMs scaled far
better with threads than did Sun's offering. If performance is of key
concern, and you're not using some of the more esoteric features of the Sun
JVM, IBM JVMs deserve serious consideration.
Most server-side Java applications are not doing computationally
intensive tasks. The tasks focus more heavily on socket IO communicating
with databases, clients via HTTP, RMI, Web services, and the like.
Processors will be given plenty of socket IO wait time to schedule parallel
tasks. For socket-IO-bound applications, be sure to consider the relative
skill of your operating system in the IP arena.
The introduction of Hyper-Threading on desktop P4 systems is also
exciting. Java developers often develop on Windows or Linux-based desktop
systems and deploy onto larger SMP and potentially SMT systems. HT will
allow a desktop developer and user to see some of the benefits of threaded
applications long before deployment to the higher-end systems.
SMT technology is here to stay. Intel's Hyper-Threading implementation
is sure to be the first of many. Chip industry watchers speculate that
Simultaneous Multithreading and thread-level parallelism will spell the
ultimate end of the "megahertz wars." A chip's performance will be tied less
to its internal clock speed and more to the bells and whistles it
incorporates. Other chip manufacturers are sure to follow suit, and all
implementations will improve in quality over time.
Operating systems are also continually improving their support for
Hyper-Threading. It does seem strange that the performance on an XP system,
which should be HT optimized, was often less HT friendly than the 2.4.18
Linux kernel, which is HT ignorant. As more sophisticated support for HT is
built into operating systems, we should see more significant performance
gains using HT in the Java world.
The combination of Java and Linux in the datacenter is rapidly gaining
ground on the Solaris/Java platform. The majority of these new Linux servers
are running high-end Intel-based hardware. Hyper-Threading will give this
trend a further push in the Linux direction.
For now, given a piece of hardware that's HT capable, the configuration
that offers the best performance under most conditions is the IBM 1.4.0 JVM
on Linux with Hyper-Threading enabled.
Resources
Microsoft license clarification for SMT systems:
www.microsoft.com/nz/licensing/downloads/
hyper_threading_processors_licensing_brief.doc
Intel Processor Specsheets
Xeon: www.intel.com/products/server/processors/
server/xeon/index.htm
Xeon DP: www.intel.com/design/xeon/prodbref/index.htm
Xeon MP:
www.intel.com/products/server/processors/
server/xeon_mp/index.htm
Pentium 4:
http://developer.intel.com/design/pentium4/
datashts/298643.htm
P4 Chipset matrix indicating HT support:
www.intel.com/design/chipsets/linecard.htm
IBM Whitepaper on Linux and Hyper-Threading:
www-106.ibm.com/developerworks/linux/
library/l-htl/?dwzone=linux
LinuxWorld article indicating Q4 2003 release of 2.6 Kernel:
www.linuxworld.com/story/33805.htm
Glossary
Physical processor: A silicon-based hardware processor
Logical processor: A hardware/software system making pseudo-parallel use of a single physical processor
Simultaneous Multithreading (SMT): The use of logical processors to increase processing throughput on a single physical processor
Symmetric Multiprocessing (SMP): The use of multiple physical processors in parallel, each running separate threads of execution
Hyper-Threading: Intel's marketing name for its SMT technology on Xeon and Pentium 4 processors
About The Author
Paul Bemowski is an independent consultant, focusing on Java solutions to
enterprise computing problems.
bemowski@yahoo.com
"Hyper-Threading Java"
Vol. 8, Issue 8, p. 45