OpenCL for Interpreter Implementation

An interpreter can dynamically compile virtual machine byte code to OpenCL1 for execution on CPU, with an improvement on execution performance relative to an efficient interpreter implementation.

Compiling via OpenCL at run-time provides the benefits of JIT compilation with implementation simplicity and portability.

Mixed mode execution

A JIT compiler can be invoked on a sub-routine, loop or other hot-spot in program execution. This creates mixed-mode execution; infrequently executed code is interpreted and frequently executed code is JIT compiled for performance. The flow of control passes between between interpreted and JIT compiled code.

It may be feasible in some JIT compilers to compile all loaded byte code, and then to discard byte code entirely as redundant. In compilation to OpenCL this is unlikely to be the case. OpenCL restricts the permitted and possible operations in OpenCL code e.g. prohibiting I/O calls 2 and calling external library routines. Therefore, mixed mode execution is required in order to not prohibit useful program operations such as I/O.

Control and Data Transfers

The interpreter and JIT code together implement a virtual machine. The VM performs computation on data in memory; a conventional JIT system permits both interpreted code and JIT compiled code equal access to memory. In contrast, the OpenCL memory model is more restrictive. An OpenCL program (a kernel) executes on data in buffers, subsets of memory specified by the host application for the kernel. An OpenCL kernel cannot (certainly, should not!) access memory outside the buffers given. Buffers, and copying of buffer contents, permits kernel execution on GPU's and systems with multiple memory spaces, but complicates use of OpenCL as a compilation target.

A conventional approach is to require the programmer to annotate programs intended to be compiled to OpenCL, specifying explicitly the data to be made visible to kernel code. This approach is awkward for complex data structures, or complex access patterns.

For a VM system, the memory ranges that a computation may access are always known if computations may access only memory areas allocated by the VM itself, typically a stack, heap, and static data area.

The preceding diagram shows the major VM components. There are two sets of program code (byte code, and OpenCL). The VM heap and stack form a contiguous memory range.

The interpreter tracks the current byte code instruction in the program being executed via the program counter (PC) variable. Additional state associated with a thread of control e.g. the stack pointer (SP) and possibly a heap pointer indicating the next available allocation point may be updated during the execution of a VM instruction.

In mixed mode execution, the state of the program (e.g. memory) can be altered by the JIT compiled OpenCL code which may perform heap allocations, push or pop stack values, and write to the VM heap or stack.

Program state may include status codes indicating conditions such as 'Out of Memory' or 'Stack Overflow' to indicate an error and indicate execution cannot proceed or that remedial action is required. 3

The solution is to ensure program state is packed into OpenCL buffers when transfers of control are performed. Happily, on the CPU, this transfer of control does not require copying the heap and stack but merely the creation of an OpenCL buffer wrapping them to serve as a parameter to the kernel. The program counter, stack pointer, and ancillary state variables can be packed into a buffer on each transfer, in order to update the state in the kernel or interpreter respectively when transferring the thread of execution from one to the other.

On entering the OpenCL code, the control buffer is consulted to set the stack pointer and to jump to the label representing the instruction to execute. Likewise, on leaving OpenCL, the SP and PC are packed back into the buffer for the interpreter to resume at the correct point with the correct sized stack.

If VM code requires to access memory allocated outside the VM's own heap and stack ranges, control should be transferred back to the interpreter to perform this.

Control Flow Constraints

OpenCL prohibits recursive control flow; this is quite a severe constraint for a direct translation of a potentially recursive input program. In such a case, a compilation process down to the VM would have to handle recursive and mutually recursive functions, or reject that input program.

The interpreter and VM here does not use the native C or OpenCL stack to implement function call and return. Therefore, the problematic recursion does not arise as the function call stack on which recursion occurs is not the OpenCL call stack4.

Not using the native call stack directly also permits control transfers between interpreted code and OpenCL at arbitrary points in the program.

Experimental Benchmark

This is a non-rigorous comparison of a single, simple benchmark. The only variation is the execution strategy. There are the following strategies:

Two interpreters, varying only in instruction dispatch:

One dynamic compiler to OpenCL:

Mixed mode interpretation:

The interpreters are implemented in C++ and compiled with the Clang compiler on OSX. The OpenCL implementation is Apple's 1.1 CPU implementation of OpenCL.

Example Toy Language and Virtual Machine

The benchmark programs are written in a small, simple imperative language with an s-expr syntax akin to Scheme. The compiler to byte code is a simple, non-optimizing single pass compiler to stack machine assembly language; an assembler, with optional peephole optimisations (well, 5 of them), translates the assembly language into byte code.

The interpreter virtual machine is a simple stack machine - not necessarily a 'good' VM design or an ideal VM instruction set, but simple and at least semi-realistic.

Many possible language and VM design parameters impact on performance. The VM design here is kept constant in order to compare the distinct execution methods.

Benchmark 1, Triple Nested Loop

A sufficiently advanced compiler will optimise this down to a print of a constant value; I did not write such a compiler, and the OpenCL compiler tested did not achieve that on the generated OpenCL either. Thus, the loops are performed the requisite number of times, and a comparison can be made of the speeds of the programs performing the iterations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

    (define (looptest)
        (define x 0)
        (define y 0)
        (define z 0)
        (define sum 0)
          (while (< x 1000)
              (set! x (+ x 1))
              (set! y 0)
              (while (< y 1000)
                  (set! y (+ y 1))
                  (set! z 0)
                  (while (< z 100)
                      (set! z (+ z 1))
                      (set! sum (+ sum 1))
                  )
              )
          )
        (print! sum)
        (sum)
    )

    (looptest)

Benchmark 2, Triple Nested Loop and "Enter JIT" block

This is the same program as before, with the single additional of an enter-jit scope around the two inner loops. When execution reaches the enter-jit scope, an interpreter transfers control into compiled OpenCL code. Control is returned to the interpreter at the end of the scope. If the program is entirely compiled to OpenCL, the enter-jit is a no-op.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
    (define (looptest)
        (define x 0)
        (define y 0)
        (define z 0)
        (define sum 0)
        (while (< x 1000)
            (set! x (+ x 1))
            (set! y 0)
            (enter-jit (
                (while (< y 1000)
                    (set! y (+ y 1))
                    (set! z 0)
                    (while (< z 100)
                        (set! z (+ z 1))
                        (set! sum (+ sum 1))
                    )
                )
            ))
        )
        (print! sum)
        (sum)
    )

    (looptest)

Stack Machine Assembly Code

The corresponding stack machine assembly language for this benchmark, after peephole optimisations have been applied. These peephole optimisations essentially cleanup some especially naive code output by the single-pass compiler.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
JMP SKIPFUN0
:looptest
PUSH_WORD
0
STORE_WORD_STATIC
0
PUSH_WORD
0
STORE_WORD_STATIC
1
PUSH_WORD
0
STORE_WORD_STATIC
2
PUSH_WORD
0
STORE_WORD_STATIC
3
:LOOPCOND1
LOAD_WORD_STATIC
0
PUSH_WORD
1000
LT
JMPT LOOPBODY3
JMP LOOPEND2
:LOOPBODY3
LOAD_WORD_STATIC
0
INCR
STORE_WORD_STATIC
0
PUSH_WORD
0
STORE_WORD_STATIC
1
ENTER_JIT
:LOOPCOND4
LOAD_WORD_STATIC
1
PUSH_WORD
1000
LT
JMPT LOOPBODY6
JMP LOOPEND5
:LOOPBODY6
LOAD_WORD_STATIC
1
INCR
STORE_WORD_STATIC
1
PUSH_WORD
0
STORE_WORD_STATIC
2
:LOOPCOND7
LOAD_WORD_STATIC
2
PUSH_WORD
100
LT
JMPT LOOPBODY9
JMP LOOPCOND4
:LOOPBODY9
LOAD_WORD_STATIC
2
INCR
STORE_WORD_STATIC
2
LOAD_WORD_STATIC
3
INCR
STORE_WORD_STATIC
3
JMP LOOPCOND7
JMP LOOPCOND4
:LOOPEND5
LEAVE_JIT
JMP LOOPCOND1
:LOOPEND2
LOAD_WORD_STATIC
3
PRINT
LOAD_WORD_STATIC
3
RETURN
:SKIPFUN0
MARK_STK
CALL looptest
STOP

OpenCL generated from byte-code

Don't Repeat Yourself aka DRY: an interpreter has implementations of each instruction e.g. instruction bodies, for operations such as Integer Addition. Generating OpenCL source code requires creating a textual representation in OpenCL code for the operations, too. To avoid unnecessary duplication, it is desirable to be able to reuse the interpreters instruction implementations, where possible or desirable, in generating source code to compile. Given that C and OpenCL are closely related, this is often practical.

Instruction bodies can be defined as C macros; these macros can in turn be 'stringified' for emission in textual source code e.g.

1
2
3
4
5
6
7
8
9
10
11
12
#define MACRO_TEXT(M) MACRO_TEXT2(M)
#define MACRO_TEXT2(M) #M

#define ADD_INS {        \
unsigned op1 = *sp--;    \
unsigned op2 = *sp;      \
unsigned r   = op1 + op2;\
*sp = r; }

// Write OpenCL code...
std::stringstream ss;
ss << MACRO_TEXT(ADD_INS);

The byte code is translated, instruction by instruction, into OpenCL source code as below:

The __OPENCL_ENTRY: label handles jumping to the appropriate point to begin execution, based on the PC value passed into the kernel. This is either program start (instruction 0) or an ENTER_JIT instruction.

When compiling all program byte code to OpenCL for execution entirely within OpenCL, as in the example below, LEAVE_JIT instructions are transformed into NOOP. Otherwise in mixed mode, a goto to the done: label is performed, exiting from the kernel.

The translation of ENTER_JIT merely defines a label, and adds an entry to the __OPENCL_ENTRY: section.

Instructions referencing constant values encoded in byte code are translated with the constant embedded directly in the OpenCL source code e.g. as is performed for PUSH_WORD.

Instructions which are the target of a jump, or the start of a function, are translated with OpenCL labels to permit the use of 'goto' to transfer control in the OpenCL source. Straight line control flow is translated directly as a sequence of successive statements.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
kernel void program(__global unsigned* ret, __global unsigned* MEM, __global unsigned* CODE) {
unsigned pc = ret[1];
unsigned memidx = 0;
#define MEMSIZE 32
__global unsigned * sp = &MEM[MEMSIZE/2]+ret[2];
__global unsigned * fp = sp;
goto __OPENCL_ENTRY;
__INS_0: /* JMP */
{  pc = 87; goto __INS_87;}
__INS_2: /* PUSH_WORD */
{  *++sp = 0;}
/* __INS_4: */ /* STORE_WORD_STATIC */
{ MEM[0] = *sp--  ;}
/* __INS_6: */ /* PUSH_WORD */
{  *++sp = 0;}
/* __INS_8: */ /* STORE_WORD_STATIC */
{ MEM[1] = *sp--  ;}
/* __INS_10: */ /* PUSH_WORD */
{  *++sp = 0;}
/* __INS_12: */ /* STORE_WORD_STATIC */
{ MEM[2] = *sp--  ;}
/* __INS_14: */ /* PUSH_WORD */
{  *++sp = 0;}
/* __INS_16: */ /* STORE_WORD_STATIC */
{ MEM[3] = *sp--  ;}
__INS_18: /* LOAD_WORD_STATIC */
{ *++sp = MEM[0] ;}
/* __INS_20: */ /* PUSH_WORD */
{  *++sp = 1000;}
/* __INS_22: */ /* LT */
{ unsigned op1 = *sp--; unsigned op2 = *sp; unsigned r = op2 < op1; *sp = r;} ;
/* __INS_23: */ /* JMPT */
{ unsigned cond = *sp--;
 if (cond) { pc = 27; goto __INS_27;}}
/* __INS_25: */ /* JMP */
{  pc = 81; goto __INS_81;}
__INS_27: /* LOAD_WORD_STATIC */
{ *++sp = MEM[0] ;}
/* __INS_29: */ /* INCR */
*sp +=1 ;
/* __INS_30: */ /* STORE_WORD_STATIC */
{ MEM[0] = *sp--  ;}
/* __INS_32: */ /* PUSH_WORD */
{  *++sp = 0;}
/* __INS_34: */ /* STORE_WORD_STATIC */
{ MEM[1] = *sp--  ;}
/* __INS_36: */ /* ENTER_JIT */
__INS_37: /* LOAD_WORD_STATIC */
{ *++sp = MEM[1] ;}
/* __INS_39: */ /* PUSH_WORD */
{  *++sp = 1000;}
/* __INS_41: */ /* LT */
{ unsigned op1 = *sp--; unsigned op2 = *sp; unsigned r = op2 < op1; *sp = r;} ;
/* __INS_42: */ /* JMPT */
{ unsigned cond = *sp--;
 if (cond) { pc = 46; goto __INS_46;}}
/* __INS_44: */ /* JMP */
{  pc = 78; goto __INS_78;}
__INS_46: /* LOAD_WORD_STATIC */
{ *++sp = MEM[1] ;}
/* __INS_48: */ /* INCR */
*sp +=1 ;
/* __INS_49: */ /* STORE_WORD_STATIC */
{ MEM[1] = *sp--  ;}
/* __INS_51: */ /* PUSH_WORD */
{  *++sp = 0;}
/* __INS_53: */ /* STORE_WORD_STATIC */
{ MEM[2] = *sp--  ;}
__INS_55: /* LOAD_WORD_STATIC */
{ *++sp = MEM[2] ;}
/* __INS_57: */ /* PUSH_WORD */
{  *++sp = 100;}
/* __INS_59: */ /* LT */
{ unsigned op1 = *sp--; unsigned op2 = *sp; unsigned r = op2 < op1; *sp = r;} ;
/* __INS_60: */ /* JMPT */
{ unsigned cond = *sp--;
 if (cond) { pc = 64; goto __INS_64;}}
/* __INS_62: */ /* JMP */
{  pc = 37; goto __INS_37;}
__INS_64: /* LOAD_WORD_STATIC */
{ *++sp = MEM[2] ;}
/* __INS_66: */ /* INCR */
*sp +=1 ;
/* __INS_67: */ /* STORE_WORD_STATIC */
{ MEM[2] = *sp--  ;}
/* __INS_69: */ /* LOAD_WORD_STATIC */
{ *++sp = MEM[3] ;}
/* __INS_71: */ /* INCR */
*sp +=1 ;
/* __INS_72: */ /* STORE_WORD_STATIC */
{ MEM[3] = *sp--  ;}
/* __INS_74: */ /* JMP */
{  pc = 55; goto __INS_55;}
/* __INS_76: */ /* JMP */
{  pc = 37; goto __INS_37;}
__INS_78: /* NOOP */
/* __INS_79: */ /* JMP */
{  pc = 18; goto __INS_18;}
__INS_81: /* LOAD_WORD_STATIC */
{ *++sp = MEM[3] ;}
/* __INS_83: */ /* PRINT */
{ unsigned op1 = *sp--; printf((char const *)"%d",op1);}
/* __INS_84: */ /* LOAD_WORD_STATIC */
{ *++sp = MEM[3] ;}
/* __INS_86: */ /* RETURN */
{ unsigned retval = *sp--;  fp[0] = retval; pc = fp[2]; sp = fp; fp = &MEM[MEMSIZE/2] + sp[1]; } ;
{ goto __CALL_RETURN;}
__INS_87: /* MARK_STK */
{ *++sp = -1; *++sp = (fp - &MEM[MEMSIZE/2]); *++sp = -1; fp = sp - 2; }
/* __INS_88: */ /* CALL */
{ fp[2] = 90;
  {  pc = 2; goto __INS_2;}}
__INS_90: /* STOP */
goto done;
__CALL_RETURN:
switch (pc) {
case 90: goto __INS_90;
}
__OPENCL_ENTRY:
switch (pc) {
case 0: goto __INS_0;
case 37: goto __INS_37;
}
done:
ret[0] = *sp; ret[1] = pc; ret[2] = sp - &MEM[MEMSIZE/2];
}

Results

The interpreters were built at -O3 optimisations. 10 timing runs were taken per benchmark, the highest and lowest times discarded, and the remainder averaged to get a 'representative' time for execution of the program, excluding compilation to byte code and OpenCL.

Shorter is better; times are in seconds.

Some trends from the results: as expected, OpenCL execution out-performs interpretation, and switch dispatch is the slowest mechanism. The optimisations had a large effect, somewhat surprising, except that the peephole opts were chosen to improve exactly the most common sequences of byte code instructions in the generated code: the loop increments are specialised to an incr instruction rather than generic addition, and accesses to the x,y,z variables use the compile time known address directly rather than load addresses onto the stack as a runtime constant.

In mixed mode, the instruction dispatch mechanism is not significant: the performance is dependent on the OpenCL executed part of the program, and outperforms the whole program compilation to OpenCL from unoptimized byte code.

The performance of the code when compiled to OpenCL seems to depend on the optimisations applied to the byte code. It seems likely that the OpenCL compiler cannot as effectively optimise the more naive byte code translation: perhaps due to increased code size, or an inability to optimise away operations on the stack.

Further work?

Ideally, this approach would be validated in the context of a substantial real world interpreter, such as Lua, OCaml or the JVM. (I'm not going to do that, certainly not yet!)

Doubtless, the generated OpenCL source code is not optimal for compilation or optimisation. One large OpenCL kernel function will probably not scale to large program sizes.

It seems logical to investigate the related, if different, use case of compilation to GPU via OpenCL. Making (more) efficient use of the CPU versus making (efficient, parallel) use of the GPU are two different use-cases for OpenCL in an interpreter. The latter is likely to require some VM level support for parallelism, explicitly parallel code, or serious program analysis, while the former is generic to an interpreter VM executing byte code.

Static and Dynamic Compilation

The compilation of byte codes to OpenCL is motivated largely by the presence of libraries providing dynamic compilation (JIT) to CPU machine code, ideally achieving high performance (relative to an interpreter or simple JIT) for minimal effort.

The same technique can be performed via static compilers, if a dependency on the presence of a C or C++ compiler is not problematic. Rather than generating an OpenCL source string for dynamic compilation, generated C or C++ source would be written to a temporary file. This in turn would be input to a static compiler, such as g++ to compile and link into a dynamic shared object (a .so) for runtime loading into the interpreter via dlopen.

Conclusions

This approach seems to work. The execution of byte code via interpretation is slower than executing the same code, compiled into OpenCL targeting the CPU. Additionally, it seems feasible to call into and return from OpenCL code in this manner for hotspots in mixed mode interpretation.

Don't read too much into this, the methodology is not sufficiently robust to draw wider conclusions than it showed a speedup on these small programs, written in this simple programming language!

At least, it seems a promising approach!

References

Some interesting reading:


  1. http://www.khronos.org/opencl/

  2. Apple's CPU OpenCL 1.1 supports printf, as do others, and OpenCL 1.2 standardises printf in OpenCL. It's still different from C, though

  3. All such error handling is currently omitted…

  4. The implementation of the instructions and any support routines required in OpenCL must not recurse.