HomeDigital EditionSearch Dotnet Cd
ASP.NET C# Certification Exams The CLI Data Access Editorials Extending .NET Fundamentals Interoperability Interviews Migrate Mobile .NET Mono .NET Interface Object-Oriented Programming Open Source Optimization Product/Book Reviews Security Source Code UML Visual Studio .NET

Portable.NET is an implementation of the Common Language Infrastructure (CLI). Its primary design goal is portability to as many platforms as possible, which it achieves through the use of interpretation rather than just-in-time compilation.

The bytecode format of the CLI presents some challenges to efficient interpreter implementation. Rather than directly interpret, we translate the bytecode into a simpler abstract machine, the Converted Virtual Machine (CVM). This machine is then interpreted using a high-performance engine.

Traditionally, abstract machines have used the same bytecode representation "on the wire" as for execution. Our work shows that there are definite performance advantages to using different bytecode representations internally and externally.

Introduction
The Common Language Infrastructure (CLI) is a set of specifications that describe a bytecode-based development and runtime environment. Portable.NET is an implementation of the CLI, whose primary design goal is portability to as many platforms as possible.

Portable.NET consists of three major components to support the CLI: a bytecode-based Common Language Runtime (CLR), a C# compiler, and a C# base class library. This article will concentrate on the runtime engine.

The runtime engine achieves portability primarily through the use of interpretation rather than just-in-time compilation. However, the bytecode format of the CLI presents some challenges to efficient interpreter implementation. This article discusses how we have overcome these challenges to build a high-performance interpreter for Common Intermediate Language (CIL) programs.

We compare the performance of Portable.NET with that of Mono to demonstrate how our approach differs from direct polymorphic interpretation and full just-in-time compilation.

Polymorphic CIL
The CIL instruction set contains instructions to perform arithmetic, logical operations, branching, method calls, object accesses, and pointer manipulation.

A unique feature of CIL, compared to other similar abstract machines, is that its instructions are polymorphic. The add instruction can be used on integers, longs, and floating-point values, for example. Other virtual machines, such as the Java Virtual Machine, use separate instructions for each type.

The polymorphic nature makes direct interpretation very inefficient, as demonstrated by the fragment from Mono's interpreter shown in Listing 1. (All of the code for this article can be downloaded from www.sys-con.com/dotnet/sourcec.cfm.) The types of all stack values must be tracked explicitly, leading to significant runtime overhead.

The challenge for Portable.NET was finding a way to implement a high-performance engine without writing a full just-in-time compiler.

The approach we took was very similar to a JIT: the CIL bytecode is translated into instructions for a simpler abstract machine, dubbed CVM (for "Converted Virtual Machine"). The CVM instructions are then interpreted using a high-performance interpreter, written in C.

As each method is entered, the following process occurs:

  1. Look for a cached CVM version of the method, and use it if found.
  2. Perform bytecode verification and convert the CIL into CVM.
  3. Record the CVM version in the cache.
  4. Jump into the interpreter to execute the CVM code.
Eventually the application's working set of methods ends up in the CVM method cache, and execution proceeds quickly.

Instead of a single add instruction, the CVM instruction set has several: iadd, ladd, fadd, etc. The conversion process chooses the most appropriate variant based on the operand types reported by the bytecode verifier.

Listing 2 shows a simplified form of the CVM interpreter code for the converted instructions. The interpreter executes more efficiently because it can assume that the values on the stack are of the correct type (bytecode verification having already been performed).

Items on the CVM execution stack are a uniform size of one word: 64-bit and larger types straddle multiple words. The CVM conversion process takes care of laying out the stack according to the types of local variables and stack items.

This isn't necessarily a new approach ­ it is normally known as "threaded interpretation" in the Forth community.

The complete list of CVM instructions is given in Listing 3.

Ramping Up
Conventional wisdom says that one should write a handcrafted assembly code loop to get a fast interpreter. However, there are some simple tricks that can be used to speed up an interpreter, even in C code.

  1. Register variables
  2. Computed gotos
  3. Direct threading
  4. CPU-specific unrolling
C compilers aren't terribly good at determining which values are most used in switch-loop interpreter code. The compiler invariably guesses wrong, favoring temporaries over important variables like the program counter and stack pointer. So it is necessary to "help" the compiler a little.

The gcc compiler can bind variables to explicit registers, as follows:

register unsigned char *pc __asm__ ("esi");

We placed the program counter, the top-of-stack pointer, and the frame pointer into x86 registers. This produced a significant improvement in performance compared to straight C code, for such a small change.

The next step was to change from switch statements to using computed gotos. This is normally referred to as token threading. The break at the end of each case is replaced with a goto statement:

goto *main_label_table[*pc];

The main_label_table contains pointers to each of the cases in the switch statement, allowing the interpreter to jump directly to the next case, avoiding the overhead of jumping back to the top of the switch-loop. More information on computed gotos can be found in the gcc documentation.

The result of these two changes (use of explicit registers and token threading) was an interpreter that was so close to a handcrafted assembly loop that there was little point writing one by hand.

The third step involved a change in representation. The switch loop and token-threaded versions select instruction handlers based on CVM bytecode. Instead of storing the single-byte opcodes, we can store the actual addresses of the opcode handlers in the CVM instruction stream. This is known as direct threading.

goto **((void **)pc);

Direct threading increases the size of the CVM code by a factor of 4, because instructions are now pointers to handlers rather than bytecodes. But it avoids the overhead of looking up values in main_label_table. On RISC platforms, this can give a significant increase in engine performance, but on x86 it isn't too impressive. If memory is an issue, token threading gives better results.

Unrolling
Direct threading really shines when combined with some native JIT techniques. We implemented a "mini JIT" that converted simple CVM instruction sequences into x86 machine code on the fly. We call this an unroller because it essentially unrolls the interpreter loop into straight-line machine code.

The unroller uses simple register allocation techniques on the basic blocks of a method. Complex instructions, especially those involving method calls, are not unrolled. It isn't possible for unrolling to achieve the same performance as a JIT, but it can get very close.

The primary advantage of the unroller compared to a JIT is that it is vastly simpler to implement. Portable.NET's x86 unroller took about two weeks to write, and we expect that other CPUs would require a similar amount of effort.

Anything that is too complicated to convert is replaced with a jump back into the interpreter core. This allows unrollers to be developed in stages, replacing one instruction at a time and then retesting. This made development a lot easier than the "all or nothing" approach required for a JIT.

PInvoke
The "platform invoke" (or PInvoke) feature is a very powerful mechanism that CLI programs can use to call legacy native code.

When Portable.NET encounters a PInvoke method reference, it compiles a small CVM stub that performs any necessary parameter marshaling and then calls the underlying native function. Upon return, the CVM stub demarshals the return value.

A similar process is used for "internalcall" methods within the runtime engine that implement built-in features for the C# class library.

Using CVM to perform marshaling operations simplifies native function invocations quite considerably. Only a small amount of platform-specific code is needed to perform the native call, for which we use the standard "libffi" library.

Inlining
Method call overhead is an issue for all interpreter-based abstract machines because method calls are more complicated than the equivalent native code.

CVM addresses this by selectively inlining some of the more commonly used methods in the C# class library. The method call is replaced with a special-purpose opcode during code conversion.

The major groups of inlineable methods within Portable.NET are currently the string, monitor, and 2D array operations.

Inlining common methods can have a dramatic impact on performance. The PNetMark "Float" benchmark improved by a factor of 12 when 2D array operations were inlined. Such operations are normally very expensive in CLRs because a method must be called for every element get or set operation.

Alternate Back Ends
The construction of the interpreter itself was made as modular as possible. The interface between the metadata handling and the execution was kept separate using a standard interface: the coder API.

Coders are an interface between the CIL front end and the execution engine. The design allows for the current CVM back end to be replaced by a fully fledged JIT without any modification to the other components in the system. The same front end could be used with the CVM engine, a native JIT, or even a polymorphic interpreter.

Performance Summary
Table 1 compares the CVM interpreter variants with the Mint polymorphic interpreter and the Mono x86 JIT. All tests were done on an 866 MHz Pentium III, running Red Hat Linux 7.1. Version 0.17 of Mono and version 0.5.0 of Portable.NET were used for these comparisons.

Table 1

As can be seen, the simple techniques described in this article produce very good results, with the unrolled version achieving 71% overall compared to Mono's JIT.

Future Work
Portable.NET's interpreter remains a work in progress. More optimizations are possible by introducing new CVM instructions for special cases.

We are also investigating selective inlining as an alternative to writing a handcrafted unroller for each CPU. The authors of that paper also reported performance of up to 70% of optimized C code using simple techniques. Their engine, for Objective Caml, has a single line of platform-specific code, to perform a flush of the CPU's instruction cache. Selective inlining doesn't work very well for x86, but it should do well for RISC CPUs like the PowerPC.

In the near future, we will investigate fully fledged JIT coders for Portable.NET, as well as front ends for other instruction sets such as the Java Virtual Machine.

Conclusion
Using a variety of well-known, yet simple, techniques, Portable.NET is able to achieve adequate performance for most application-oriented tasks.

At the same time, the code is highly portable. Ports to new platforms take a matter of days, sometimes hours (e.g., the author ported the code to Mac OSX in a single day).

Acknowledgments
Portable.NET would not have been possible without the generous assistance of volunteers from the DotGNU community.

References

  • Southern Storm Software: www.southern-storm.com.au/portable_net.html
  • Common Language Infrastructure, Partitions I to IV, ECMA 335: ftp://ftp.ecma.ch/ecma-st/Ecma-335-part-i-iv.pdf
  • The Mono Project: www.go-mono.com
  • Lindholm, T. and Yellin. F, (1999). The Java Virtual Machine Specification, Second Edition. Addison-Wesley.
  • A Portable Forth Engine: www.complang.tuwien.ac.at/forth/threaded-code
  • gcc documentation: http://gcc.gnu.org/onlinedocs/gcc-3.2.1/ gcc/Labels-as-Values.html
  • Puimarta, I. and Riccardi, F. "Optimizing direct threaded code by selective inlining," SIGPLAN '98, pages 291-300, ACM Press.
  • DotGNU: www.dotgnu.org

    Copyright ©2003 Rhys Weatherley and Gopal V. Permission is granted to copy, distribute, and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

    Author Bios
    Rhys Weatherley is a member of the DotGNU Steering Committee and the primary author of Portable.NET. He graduated from the University of Queensland in 1990 and has worked at a number of universities and companies in Australia and the U.S. rweather@southern-storm.com.au

    Gopal V is a final-year computer science student in India and has been involved with DotGNU since August 2001. He's the pnetlib lead developer and has worked on the compiler and runtime as well. Gopal's current pet project is the Java language front-end for the compiler. gopalv82@symonds.net

    	
    
    
    
    
    Listing 1: Polymorphic interpretation
    
    case CEE_LDC_I4_0:          case CEE_ADD:
      sp->type = VAL_I32;         ++ip;
      sp->data.i = 0;             --sp;
      ++sp;                       if(sp->type == VAL_I32) {
      ++ip;                         sp[-1].data.i += sp->data.i;
      break;                      } else if(sp->type == VAL_I64) {
                                    sp[-1].data.l += sp->data.l;
    case CEE_LDC_I8:              } else if(sp[-2].type == VAL_DOUBLE) {
      ++ip;                         sp[-1].data.f += sp->data.f;
      sp->type = VAL_I64;         } else {
      sp->data.l = read64(ip);      ...
      ip += 8;                    }
      ++sp;                       break;
      break;
    
    
    
    Listing 2: Interpreting converted instructions
    
    case COP_IADD:
      sp[-2].intval += sp[-1].intval;
      --sp;
      ++pc;
      break;
    
    case COP_LADD:
      *((ILInt64 *)(&(sp[-(WORDSPERLONG * 2])))) +=
          *((ILInt64 *)(&(sp[-WORDSPERLONG]))));
      sp -= WORDSPERLONG;
      ++pc;
      break;
    
    case COP_FADD:
      *((ILNativeFloat *)(&(sp[-(WORDSPERFLOAT * 2])))) +=
          *((ILNativeFloat *)(&(sp[-WORDSPERFLOAT]))));
      sp -= WORDSPERFLOAT;
      ++pc;
      break;
    
    
    
    Listing 3: Complete CVM instruction set
    
    ansi2str array2ptr array\_len beq bfixup bge bge\_un bgt
    bgt\_un ble ble\_un bload blt blt\_un bne box box\_ptr br
    bread bread\_elem bread\_field break brfalse br\_long brnonnull
    brnull br\_peq br\_pne brtrue bstore bwrite bwrite\_elem
    bwrite\_field bwrite\_r call call\_ctor call\_interface
    call\_native call\_native\_raw call\_native\_void
    call\_native\_void\_raw call\_virtual castclass castinterface
    cctor\_once ckarray\_load\_i4 ckarray\_load\_i8 ckarray\_store\_i8
    ckfinite ckheight ckheight\_n cknull cknull\_n delegate2fnptr
    dfixup dread dread\_elem dup dup2 dup\_n dup\_word\_n dwrite
    dwrite\_elem dwrite\_r enter\_try f2d f2d\_aligned f2f f2f\_aligned
    f2i f2i\_ovf f2iu f2iu\_ovf f2l f2l\_ovf f2lu f2lu\_ovf fadd fcmpg
    fcmpl fdiv ffixup fix\_i4\_i fix\_i4\_u fmul fneg fread fread\_elem
    frem fsub fwrite fwrite\_elem fwrite\_r get2d get\_static i2b
    i2b\_aligned i2b\_ovf i2f i2iu\_ovf i2l i2p\_lower i2s i2s\_aligned
    i2s\_ovf i2ub i2ub\_ovf i2ul\_ovf i2us i2us\_ovf iadd iadd\_ovf
    iadd\_ovf\_un iand icmp icmp\_un idiv idiv\_un iload iload\_<n>
    imul imul\_ovf imul\_ovf\_un ineg inot ior iread iread\_elem
    iread\_field iread\_this irem irem\_un ishl ishr ishr\_un isinst
    isinterface istore istore\_<n> isub isub\_ovf isub\_ovf\_un iu2b\_ovf
    iu2f iu2i\_ovf iu2l iu2s\_ovf iu2ub\_ovf iu2us\_ovf iwrite iwrite\_elem
    iwrite\_field iwrite\_r ixor jsr l2f l2i l2i\_ovf l2ui\_ovf l2ul\_ovf
    ladd ladd\_ovf ladd\_ovf\_un land lcmp lcmp\_un ldc\_i4 ldc\_i4\_<n>
    ldc\_i4\_s ldc\_i8 ldc\_r4 ldc\_r8 ldftn ldinterfftn ldiv ldiv\_un
    ldnull ldstr ldtoken ldvirtftn lmul lmul\_ovf lmul\_ovf\_un lneg
    lnot local\_alloc lor lread\_elem lrem lrem\_un lshl lshr lshr\_un
    lsub lsub\_ovf lsub\_ovf\_un lu2f lu2i\_ovf lu2iu\_ovf lu2l\_ovf
    lwrite\_elem lxor maddr memcpy memmove memset memzero mk\_local\_1
    mk\_local\_2 mk\_local\_3 mk\_local\_n mkrefany mload monitor\_enter
    monitor\_exit mread mstore mwrite mwrite\_r new new\_value nop
    pack\_varargs padd\_i4 padd\_i4\_r padd\_i8 padd\_i8\_r padd\_offset
    padd\_offset\_n pcmp pload pload\_<n> pop pop2 pop\_n pread pread\_elem
    pread\_field pread\_this prefix pstore pstore\_<n> psub psub\_i4
    psub\_i8 pushdown push\_thread push\_thread\_raw pwrite pwrite\_elem
    pwrite\_field pwrite\_r refanytype refanyval refarray2ansi refarray2utf8
    ret\_jsr return return\_1 return\_2 return\_n set2d seteq setge setgt
    setle setlt setne set\_num\_args sfixup squash sread sread\_elem sread\_field
    str2ansi str2utf8 string\_concat\_2 string\_concat\_3 string\_concat\_4
    string\_eq string\_get\_char string\_ne switch swrite swrite\_elem
    swrite\_field swrite\_r tail\_call throw throw\_caller type\_from\_handle
    ubread ubread\_elem ubread\_field unroll\_method usread usread\_elem
    usread\_field utf82str waddr waddr\_native\_<n> wide}
    

    All Rights Reserved
    Copyright ©  2004 SYS-CON Media, Inc.

      E-mail: info@sys-con.com