Portable.NET is an implementation of the Common Language Infrastructure
(CLI). Its primary design goal is portability to as many platforms as
possible, which it achieves through the use of interpretation rather than
just-in-time compilation.
The bytecode format of the CLI presents some challenges to efficient
interpreter implementation. Rather than directly interpret, we translate the
bytecode into a simpler abstract machine, the Converted Virtual Machine (CVM). This machine is then interpreted using a high-performance engine.
Traditionally, abstract machines have used the same bytecode
representation "on the wire" as for execution. Our work shows that there are
definite performance advantages to using different bytecode representations
internally and externally.
Introduction
The Common Language Infrastructure (CLI) is a set of specifications that describe a bytecode-based development and runtime environment. Portable.NET is an implementation of
the CLI, whose primary design goal is portability to as many platforms as
possible.
Portable.NET consists of three major components to support the CLI: a
bytecode-based Common Language Runtime (CLR), a C# compiler, and a C# base
class library. This article will concentrate on the runtime engine.
The runtime engine achieves portability primarily through the use of
interpretation rather than just-in-time compilation. However, the bytecode
format of the CLI presents some challenges to efficient interpreter
implementation. This article discusses how we have overcome these challenges
to build a high-performance interpreter for Common Intermediate Language
(CIL) programs.
We compare the performance of Portable.NET with that of Mono to
demonstrate how our approach differs from direct polymorphic interpretation
and full just-in-time compilation.
Polymorphic CIL
The CIL instruction set contains instructions to perform arithmetic,
logical operations, branching, method calls, object accesses, and pointer
manipulation.
A unique feature of CIL, compared to other similar abstract machines, is
that its instructions are polymorphic. The add instruction can be used on
integers, longs, and floating-point values, for example. Other virtual
machines, such as the Java Virtual Machine, use separate instructions for
each type.
The polymorphic nature makes direct interpretation very inefficient, as
demonstrated by the fragment from Mono's interpreter shown in Listing 1.
(All of the code for this article can be downloaded from
www.sys-con.com/dotnet/sourcec.cfm.) The types of all stack values must be tracked explicitly, leading to significant runtime overhead.
The challenge for Portable.NET was finding a way to implement a
high-performance engine without writing a full just-in-time compiler.
The approach we took was very similar to a JIT: the CIL bytecode is
translated into instructions for a simpler abstract machine, dubbed CVM (for
"Converted Virtual Machine"). The CVM instructions are then interpreted
using a high-performance interpreter, written in C.
As each method is entered, the following process occurs:
- Look for a cached CVM version of the method, and use it if found.
- Perform bytecode verification and convert the CIL into CVM.
- Record the CVM version in the cache.
- Jump into the interpreter to execute the CVM code.
Eventually the application's working set of methods ends up in the CVM
method cache, and execution proceeds quickly.
Instead of a single add instruction, the CVM instruction set has
several: iadd, ladd, fadd, etc. The conversion process chooses the most
appropriate variant based on the operand types reported by the bytecode
verifier.
Listing 2 shows a simplified form of the CVM interpreter code for the
converted instructions. The interpreter executes more efficiently because it
can assume that the values on the stack are of the correct type (bytecode
verification having already been performed).
Items on the CVM execution stack are a uniform size of one word: 64-bit
and larger types straddle multiple words. The CVM conversion process takes
care of laying out the stack according to the types of local variables and
stack items.
This isn't necessarily a new approach it is normally known as
"threaded interpretation" in the Forth community.
The complete list of CVM instructions is given in Listing 3.
Ramping Up
Conventional wisdom says that one should write a handcrafted assembly
code loop to get a fast interpreter. However, there are some simple tricks
that can be used to speed up an interpreter, even in C code.
- Register variables
- Computed gotos
- Direct threading
- CPU-specific unrolling
C compilers aren't terribly good at determining which values are most
used in switch-loop interpreter code. The compiler invariably guesses wrong,
favoring temporaries over important variables like the program counter and
stack pointer. So it is necessary to "help" the compiler a little.
The gcc compiler can bind variables to explicit registers, as follows:
register unsigned char *pc __asm__ ("esi");
We placed the program counter, the top-of-stack pointer, and the frame
pointer into x86 registers. This produced a significant improvement in
performance compared to straight C code, for such a small change.
The next step was to change from switch statements to using computed
gotos. This is normally referred to as token threading. The break at the end
of each case is replaced with a goto statement:
goto *main_label_table[*pc];
The main_label_table contains pointers to each of the cases in the
switch statement, allowing the interpreter to jump directly to the next
case, avoiding the overhead of jumping back to the top of the switch-loop.
More information on computed gotos can be found in the gcc documentation.
The result of these two changes (use of explicit registers and token
threading) was an interpreter that was so close to a handcrafted assembly
loop that there was little point writing one by hand.
The third step involved a change in representation. The switch loop and
token-threaded versions select instruction handlers based on CVM bytecode.
Instead of storing the single-byte opcodes, we can store the actual
addresses of the opcode handlers in the CVM instruction stream. This is
known as direct threading.
goto **((void **)pc);
Direct threading increases the size of the CVM code by a factor of 4,
because instructions are now pointers to handlers rather than bytecodes. But
it avoids the overhead of looking up values in main_label_table. On RISC
platforms, this can give a significant increase in engine performance, but
on x86 it isn't too impressive. If memory is an issue, token threading gives
better results.
Unrolling
Direct threading really shines when combined with some native JIT
techniques. We implemented a "mini JIT" that converted simple CVM
instruction sequences into x86 machine code on the fly. We call this an
unroller because it essentially unrolls the interpreter loop into
straight-line machine code.
The unroller uses simple register allocation techniques on the basic
blocks of a method. Complex instructions, especially those involving method
calls, are not unrolled. It isn't possible for unrolling to achieve the same
performance as a JIT, but it can get very close.
The primary advantage of the unroller compared to a JIT is that it is
vastly simpler to implement. Portable.NET's x86 unroller took about two
weeks to write, and we expect that other CPUs would require a similar amount
of effort.
Anything that is too complicated to convert is replaced with a jump back
into the interpreter core. This allows unrollers to be developed in stages,
replacing one instruction at a time and then retesting. This made
development a lot easier than the "all or nothing" approach required for a
JIT.
PInvoke
The "platform invoke" (or PInvoke) feature is a very powerful mechanism
that CLI programs can use to call legacy native code.
When Portable.NET encounters a PInvoke method reference, it compiles a
small CVM stub that performs any necessary parameter marshaling and then
calls the underlying native function. Upon return, the CVM stub demarshals
the return value.
A similar process is used for "internalcall" methods within the runtime
engine that implement built-in features for the C# class library.
Using CVM to perform marshaling operations simplifies native function
invocations quite considerably. Only a small amount of platform-specific
code is needed to perform the native call, for which we use the standard
"libffi" library.
Inlining
Method call overhead is an issue for all interpreter-based abstract
machines because method calls are more complicated than the equivalent
native code.
CVM addresses this by selectively inlining some of the more commonly
used methods in the C# class library. The method call is replaced with a
special-purpose opcode during code conversion.
The major groups of inlineable methods within Portable.NET are currently
the string, monitor, and 2D array operations.
Inlining common methods can have a dramatic impact on performance. The
PNetMark "Float" benchmark improved by a factor of 12 when 2D array
operations were inlined. Such operations are normally very expensive in CLRs
because a method must be called for every element get or set operation.
Alternate Back Ends
The construction of the interpreter itself was made as modular as
possible. The interface between the metadata handling and the execution was
kept separate using a standard interface: the coder API.
Coders are an interface between the CIL front end and the execution
engine. The design allows for the current CVM back end to be replaced by a
fully fledged JIT without any modification to the other components in the
system. The same front end could be used with the CVM engine, a native JIT,
or even a polymorphic interpreter.
Performance Summary
Table 1 compares the CVM interpreter variants with the Mint polymorphic
interpreter and the Mono x86 JIT. All tests were done on an 866 MHz Pentium
III, running Red Hat Linux 7.1. Version 0.17 of Mono and version 0.5.0 of
Portable.NET were used for these comparisons.
As can be seen, the simple techniques described in this article produce
very good results, with the unrolled version achieving 71% overall compared
to Mono's JIT.
Future Work
Portable.NET's interpreter remains a work in progress. More
optimizations are possible by introducing new CVM instructions for special
cases.
We are also investigating selective inlining as an alternative to
writing a handcrafted unroller for each CPU. The authors of that paper also
reported performance of up to 70% of optimized C code using simple
techniques. Their engine, for Objective Caml, has a single line of
platform-specific code, to perform a flush of the CPU's instruction cache.
Selective inlining doesn't work very well for x86, but it should do well for
RISC CPUs like the PowerPC.
In the near future, we will investigate fully fledged JIT coders for
Portable.NET, as well as front ends for other instruction sets such as the
Java Virtual Machine.
Conclusion
Using a variety of well-known, yet simple, techniques, Portable.NET is able to achieve adequate performance for most application-oriented tasks.
At the same time, the code is highly portable. Ports to new platforms
take a matter of days, sometimes hours (e.g., the author ported the code to
Mac OSX in a single day).
Acknowledgments
Portable.NET would not have been possible without the generous
assistance of volunteers from the DotGNU community.
References
Southern Storm Software:
www.southern-storm.com.au/portable_net.html
Common Language Infrastructure, Partitions I to IV, ECMA 335:
ftp://ftp.ecma.ch/ecma-st/Ecma-335-part-i-iv.pdf
The Mono Project:
www.go-mono.com
Lindholm, T. and Yellin. F, (1999). The Java Virtual Machine
Specification, Second Edition. Addison-Wesley.
A Portable Forth Engine:
www.complang.tuwien.ac.at/forth/threaded-code
gcc documentation:
http://gcc.gnu.org/onlinedocs/gcc-3.2.1/
gcc/Labels-as-Values.html
Puimarta, I. and Riccardi, F. "Optimizing direct threaded code by
selective inlining," SIGPLAN '98, pages 291-300, ACM Press.
DotGNU: www.dotgnu.org
Copyright ©2003 Rhys Weatherley and Gopal V. Permission is granted to copy,
distribute, and/or modify this document under the terms of the GNU Free
Documentation License, Version 1.2 or any later version published by the
Free Software Foundation; with no Invariant Sections, no Front-Cover Texts,
and no Back-Cover Texts.
Author Bios
Rhys Weatherley is a member of the DotGNU Steering Committee and the primary
author of Portable.NET. He graduated from the University of Queensland in
1990 and has worked at a number of universities and companies in Australia
and the U.S.
rweather@southern-storm.com.au
Gopal V is a final-year computer science student in India and has been
involved with DotGNU since August 2001. He's the pnetlib lead developer and
has worked on the compiler and runtime as well. Gopal's current pet project
is the Java language front-end for the compiler.
gopalv82@symonds.net
Listing 1: Polymorphic interpretation
case CEE_LDC_I4_0: case CEE_ADD:
sp->type = VAL_I32; ++ip;
sp->data.i = 0; --sp;
++sp; if(sp->type == VAL_I32) {
++ip; sp[-1].data.i += sp->data.i;
break; } else if(sp->type == VAL_I64) {
sp[-1].data.l += sp->data.l;
case CEE_LDC_I8: } else if(sp[-2].type == VAL_DOUBLE) {
++ip; sp[-1].data.f += sp->data.f;
sp->type = VAL_I64; } else {
sp->data.l = read64(ip); ...
ip += 8; }
++sp; break;
break;
Listing 2: Interpreting converted instructions
case COP_IADD:
sp[-2].intval += sp[-1].intval;
--sp;
++pc;
break;
case COP_LADD:
*((ILInt64 *)(&(sp[-(WORDSPERLONG * 2])))) +=
*((ILInt64 *)(&(sp[-WORDSPERLONG]))));
sp -= WORDSPERLONG;
++pc;
break;
case COP_FADD:
*((ILNativeFloat *)(&(sp[-(WORDSPERFLOAT * 2])))) +=
*((ILNativeFloat *)(&(sp[-WORDSPERFLOAT]))));
sp -= WORDSPERFLOAT;
++pc;
break;
Listing 3: Complete CVM instruction set
ansi2str array2ptr array\_len beq bfixup bge bge\_un bgt
bgt\_un ble ble\_un bload blt blt\_un bne box box\_ptr br
bread bread\_elem bread\_field break brfalse br\_long brnonnull
brnull br\_peq br\_pne brtrue bstore bwrite bwrite\_elem
bwrite\_field bwrite\_r call call\_ctor call\_interface
call\_native call\_native\_raw call\_native\_void
call\_native\_void\_raw call\_virtual castclass castinterface
cctor\_once ckarray\_load\_i4 ckarray\_load\_i8 ckarray\_store\_i8
ckfinite ckheight ckheight\_n cknull cknull\_n delegate2fnptr
dfixup dread dread\_elem dup dup2 dup\_n dup\_word\_n dwrite
dwrite\_elem dwrite\_r enter\_try f2d f2d\_aligned f2f f2f\_aligned
f2i f2i\_ovf f2iu f2iu\_ovf f2l f2l\_ovf f2lu f2lu\_ovf fadd fcmpg
fcmpl fdiv ffixup fix\_i4\_i fix\_i4\_u fmul fneg fread fread\_elem
frem fsub fwrite fwrite\_elem fwrite\_r get2d get\_static i2b
i2b\_aligned i2b\_ovf i2f i2iu\_ovf i2l i2p\_lower i2s i2s\_aligned
i2s\_ovf i2ub i2ub\_ovf i2ul\_ovf i2us i2us\_ovf iadd iadd\_ovf
iadd\_ovf\_un iand icmp icmp\_un idiv idiv\_un iload iload\_<n>
imul imul\_ovf imul\_ovf\_un ineg inot ior iread iread\_elem
iread\_field iread\_this irem irem\_un ishl ishr ishr\_un isinst
isinterface istore istore\_<n> isub isub\_ovf isub\_ovf\_un iu2b\_ovf
iu2f iu2i\_ovf iu2l iu2s\_ovf iu2ub\_ovf iu2us\_ovf iwrite iwrite\_elem
iwrite\_field iwrite\_r ixor jsr l2f l2i l2i\_ovf l2ui\_ovf l2ul\_ovf
ladd ladd\_ovf ladd\_ovf\_un land lcmp lcmp\_un ldc\_i4 ldc\_i4\_<n>
ldc\_i4\_s ldc\_i8 ldc\_r4 ldc\_r8 ldftn ldinterfftn ldiv ldiv\_un
ldnull ldstr ldtoken ldvirtftn lmul lmul\_ovf lmul\_ovf\_un lneg
lnot local\_alloc lor lread\_elem lrem lrem\_un lshl lshr lshr\_un
lsub lsub\_ovf lsub\_ovf\_un lu2f lu2i\_ovf lu2iu\_ovf lu2l\_ovf
lwrite\_elem lxor maddr memcpy memmove memset memzero mk\_local\_1
mk\_local\_2 mk\_local\_3 mk\_local\_n mkrefany mload monitor\_enter
monitor\_exit mread mstore mwrite mwrite\_r new new\_value nop
pack\_varargs padd\_i4 padd\_i4\_r padd\_i8 padd\_i8\_r padd\_offset
padd\_offset\_n pcmp pload pload\_<n> pop pop2 pop\_n pread pread\_elem
pread\_field pread\_this prefix pstore pstore\_<n> psub psub\_i4
psub\_i8 pushdown push\_thread push\_thread\_raw pwrite pwrite\_elem
pwrite\_field pwrite\_r refanytype refanyval refarray2ansi refarray2utf8
ret\_jsr return return\_1 return\_2 return\_n set2d seteq setge setgt
setle setlt setne set\_num\_args sfixup squash sread sread\_elem sread\_field
str2ansi str2utf8 string\_concat\_2 string\_concat\_3 string\_concat\_4
string\_eq string\_get\_char string\_ne switch swrite swrite\_elem
swrite\_field swrite\_r tail\_call throw throw\_caller type\_from\_handle
ubread ubread\_elem ubread\_field unroll\_method usread usread\_elem
usread\_field utf82str waddr waddr\_native\_<n> wide}
All Rights Reserved
Copyright © 2004 SYS-CON Media, Inc.
E-mail:
info@sys-con.com