An interpreter can dynamically compile virtual machine byte code to OpenCL1 for execution on CPU, with an improvement on execution performance relative to an efficient interpreter implementation.
Compiling via OpenCL at run-time provides the benefits of JIT compilation with implementation simplicity and portability.
A JIT compiler can be invoked on a sub-routine, loop or other hot-spot in program execution. This creates mixed-mode execution; infrequently executed code is interpreted and frequently executed code is JIT compiled for performance. The flow of control passes between between interpreted and JIT compiled code.
It may be feasible in some JIT compilers to compile all loaded byte code, and then to discard byte code entirely as redundant. In compilation to OpenCL this is unlikely to be the case. OpenCL restricts the permitted and possible operations in OpenCL code e.g. prohibiting I/O calls 2 and calling external library routines. Therefore, mixed mode execution is required in order to not prohibit useful program operations such as I/O.
The interpreter and JIT code together implement a virtual machine. The VM performs computation on data in memory; a conventional JIT system permits both interpreted code and JIT compiled code equal access to memory. In contrast, the OpenCL memory model is more restrictive. An OpenCL program (a kernel) executes on data in buffers, subsets of memory specified by the host application for the kernel. An OpenCL kernel cannot (certainly, should not!) access memory outside the buffers given. Buffers, and copying of buffer contents, permits kernel execution on GPU's and systems with multiple memory spaces, but complicates use of OpenCL as a compilation target.
A conventional approach is to require the programmer to annotate programs intended to be compiled to OpenCL, specifying explicitly the data to be made visible to kernel code. This approach is awkward for complex data structures, or complex access patterns.
For a VM system, the memory ranges that a computation may access are always known if computations may access only memory areas allocated by the VM itself, typically a stack, heap, and static data area.
The preceding diagram shows the major VM components. There are two sets of program code (byte code, and OpenCL). The VM heap and stack form a contiguous memory range.
The interpreter tracks the current byte code instruction in the program being executed via the program counter (PC) variable. Additional state associated with a thread of control e.g. the stack pointer (SP) and possibly a heap pointer indicating the next available allocation point may be updated during the execution of a VM instruction.
In mixed mode execution, the state of the program (e.g. memory) can be altered by the JIT compiled OpenCL code which may perform heap allocations, push or pop stack values, and write to the VM heap or stack.
Program state may include status codes indicating conditions such as 'Out of Memory' or 'Stack Overflow' to indicate an error and indicate execution cannot proceed or that remedial action is required. 3
The solution is to ensure program state is packed into OpenCL buffers when transfers of control are performed. Happily, on the CPU, this transfer of control does not require copying the heap and stack but merely the creation of an OpenCL buffer wrapping them to serve as a parameter to the kernel. The program counter, stack pointer, and ancillary state variables can be packed into a buffer on each transfer, in order to update the state in the kernel or interpreter respectively when transferring the thread of execution from one to the other.
On entering the OpenCL code, the control buffer is consulted to set the stack pointer and to jump to the label representing the instruction to execute. Likewise, on leaving OpenCL, the SP and PC are packed back into the buffer for the interpreter to resume at the correct point with the correct sized stack.
If VM code requires to access memory allocated outside the VM's own heap and stack ranges, control should be transferred back to the interpreter to perform this.
OpenCL prohibits recursive control flow; this is quite a severe constraint for a direct translation of a potentially recursive input program. In such a case, a compilation process down to the VM would have to handle recursive and mutually recursive functions, or reject that input program.
The interpreter and VM here does not use the native C or OpenCL stack to implement function call and return. Therefore, the problematic recursion does not arise as the function call stack on which recursion occurs is not the OpenCL call stack4.
Not using the native call stack directly also permits control transfers between interpreted code and OpenCL at arbitrary points in the program.
This is a non-rigorous comparison of a single, simple benchmark. The only variation is the execution strategy. There are the following strategies:
Two interpreters, varying only in instruction dispatch:
One dynamic compiler to OpenCL:
Mixed mode interpretation:
The interpreters are implemented in C++ and compiled with the Clang compiler on OSX. The OpenCL implementation is Apple's 1.1 CPU implementation of OpenCL.
The benchmark programs are written in a small, simple imperative language with an s-expr syntax akin to Scheme. The compiler to byte code is a simple, non-optimizing single pass compiler to stack machine assembly language; an assembler, with optional peephole optimisations (well, 5 of them), translates the assembly language into byte code.
The interpreter virtual machine is a simple stack machine - not necessarily a 'good' VM design or an ideal VM instruction set, but simple and at least semi-realistic.
Many possible language and VM design parameters impact on performance. The VM design here is kept constant in order to compare the distinct execution methods.
A sufficiently advanced compiler will optimise this down to a print of a constant value; I did not write such a compiler, and the OpenCL compiler tested did not achieve that on the generated OpenCL either. Thus, the loops are performed the requisite number of times, and a comparison can be made of the speeds of the programs performing the iterations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | |
This is the same program as before, with the single additional of an enter-jit scope around the two inner loops. When execution reaches the enter-jit scope, an interpreter transfers control into compiled OpenCL code. Control is returned to the interpreter at the end of the scope. If the program is entirely compiled to OpenCL, the enter-jit is a no-op.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | |
The corresponding stack machine assembly language for this benchmark, after peephole optimisations have been applied. These peephole optimisations essentially cleanup some especially naive code output by the single-pass compiler.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | |
Don't Repeat Yourself aka DRY: an interpreter has implementations of each instruction e.g. instruction bodies, for operations such as Integer Addition. Generating OpenCL source code requires creating a textual representation in OpenCL code for the operations, too. To avoid unnecessary duplication, it is desirable to be able to reuse the interpreters instruction implementations, where possible or desirable, in generating source code to compile. Given that C and OpenCL are closely related, this is often practical.
Instruction bodies can be defined as C macros; these macros can in turn be 'stringified' for emission in textual source code e.g.
1 2 3 4 5 6 7 8 9 10 11 12 | |
The byte code is translated, instruction by instruction, into OpenCL source code as below:
The __OPENCL_ENTRY: label handles jumping to the appropriate point to begin execution, based on the PC value passed into the kernel. This is either program start (instruction 0) or an ENTER_JIT instruction.
When compiling all program byte code to OpenCL for execution entirely within OpenCL, as in the example below, LEAVE_JIT instructions are transformed into NOOP. Otherwise in mixed mode, a goto to the done: label is performed, exiting from the kernel.
The translation of ENTER_JIT merely defines a label, and adds an entry to the __OPENCL_ENTRY: section.
Instructions referencing constant values encoded in byte code are translated with the constant embedded directly in the OpenCL source code e.g. as is performed for PUSH_WORD.
Instructions which are the target of a jump, or the start of a function, are translated with OpenCL labels to permit the use of 'goto' to transfer control in the OpenCL source. Straight line control flow is translated directly as a sequence of successive statements.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | |
The interpreters were built at -O3 optimisations. 10 timing runs were taken per benchmark, the highest and lowest times discarded, and the remainder averaged to get a 'representative' time for execution of the program, excluding compilation to byte code and OpenCL.
Shorter is better; times are in seconds.
Some trends from the results: as expected, OpenCL execution out-performs interpretation, and switch dispatch is the slowest mechanism. The optimisations had a large effect, somewhat surprising, except that the peephole opts were chosen to improve exactly the most common sequences of byte code instructions in the generated code: the loop increments are specialised to an incr instruction rather than generic addition, and accesses to the x,y,z variables use the compile time known address directly rather than load addresses onto the stack as a runtime constant.
In mixed mode, the instruction dispatch mechanism is not significant: the performance is dependent on the OpenCL executed part of the program, and outperforms the whole program compilation to OpenCL from unoptimized byte code.
The performance of the code when compiled to OpenCL seems to depend on the optimisations applied to the byte code. It seems likely that the OpenCL compiler cannot as effectively optimise the more naive byte code translation: perhaps due to increased code size, or an inability to optimise away operations on the stack.
Ideally, this approach would be validated in the context of a substantial real world interpreter, such as Lua, OCaml or the JVM. (I'm not going to do that, certainly not yet!)
Doubtless, the generated OpenCL source code is not optimal for compilation or optimisation. One large OpenCL kernel function will probably not scale to large program sizes.
It seems logical to investigate the related, if different, use case of compilation to GPU via OpenCL. Making (more) efficient use of the CPU versus making (efficient, parallel) use of the GPU are two different use-cases for OpenCL in an interpreter. The latter is likely to require some VM level support for parallelism, explicitly parallel code, or serious program analysis, while the former is generic to an interpreter VM executing byte code.
The compilation of byte codes to OpenCL is motivated largely by the presence of libraries providing dynamic compilation (JIT) to CPU machine code, ideally achieving high performance (relative to an interpreter or simple JIT) for minimal effort.
The same technique can be performed via static compilers, if a dependency on the presence of a C or C++ compiler is not problematic. Rather than generating an OpenCL source string for dynamic compilation, generated C or C++ source would be written to a temporary file. This in turn would be input to a static compiler, such as g++ to compile and link into a dynamic shared object (a .so) for runtime loading into the interpreter via dlopen.
This approach seems to work. The execution of byte code via interpretation is slower than executing the same code, compiled into OpenCL targeting the CPU. Additionally, it seems feasible to call into and return from OpenCL code in this manner for hotspots in mixed mode interpretation.
Don't read too much into this, the methodology is not sufficiently robust to draw wider conclusions than it showed a speedup on these small programs, written in this simple programming language!
At least, it seems a promising approach!
Some interesting reading:
http://www.khronos.org/opencl/↩
Apple's CPU OpenCL 1.1 supports printf, as do others, and OpenCL 1.2 standardises printf in OpenCL. It's still different from C, though↩
All such error handling is currently omitted…↩
The implementation of the instructions and any support routines required in OpenCL must not recurse.↩