Running kernels on the GPU

We will see how we can run kernels on the GPU. The following is a typical CUDA code which adds the elements of two arrays of a given size:

// file: add_gpu.cu

#include "gpu.h"

__global__
void add(int n, float* x, float* y, float* r)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = index; i < n; i += stride)
        r[i] = x[i] + y[i];
}

void add_gpu(int n, float* x, float* y, float* r)
{
    int blockSize = 256;
    int numBlocks = (n + blockSize - 1) / blockSize;
    add<<<numBlocks, blockSize>>>(n, x, y, r);
    cudaDeviceSynchronize();
}

The add function is called a CUDA kernel (not to be confused with the gumath kernels!). This is what will actually run on the GPU. The reason why a GPU is faster than a CPU is because it can massively parallelize computations, and this is why we have these index and stride variables: the kernel will be applied on different parts of the data at the same time.

Our gumath kernel however will use the add_gpu function, which internally calls add with a special CUDA syntax and some extra-parameters (basically specifying how much it will be parallelized). The cudaDeviceSynchronize() function call blocks the CPU execution until all GPU computations are done.

The GPU has its own memory, which is different from the CPU memory. When we want to do a computation on the GPU, we first have to copy the data from the CPU side to the GPU side, and when we want to retrieve the results from the GPU, we have to copy its data back to the CPU. This can be taken care of by the so called “unified memory”, which provides a single memory space accessible by the GPU and the CPU. The following file contains functions to allocate memory in the unified memory and to delete it, here again through special CUDA functions:

// file: gpu.cu

#include "gpu.h"

float* get_array(int n)
{
    float* x;
    cudaMallocManaged(&x, n * sizeof(float));
    return x;
}

void del_array(float* x)
{
    cudaFree(x);
}

Now let’s see how the gpu.h file looks like:

// file: gpu.h

extern "C" void add_gpu(int n, float* x, float* y, float *r);
extern "C" float* get_array(int n);
extern "C" void del_array(float* x);

It consists of the prototypes of the add_gpu function for which we want to make a kernel, and the get_array and del_array functions which we will use to manage the memory for our data. Note the extern "C" declaration: because nvcc (the CUDA compiler) is a C++ compiler, we need to expose them as C functions to Python.

Since gpu.cu only manages the GPU memory, and is independent of the gumath kernel generation, we will simply access its functions through a shared library. This is how we compile it:

$ nvcc -o libgpu.so --compiler-options "-fPIC" --shared gpu.cu

This gives us a libgpu.so library that we can interface with in Python using ctypes. The following code wraps the C functions to Python functions:

# file: gpu.py

import ctypes
import numpy as np
from xnd import xnd

gpu = ctypes.CDLL('./libgpu.so')
gpu.get_array.restype = ctypes.POINTER(ctypes.c_float)
gpu.del_array.argtypes = [ctypes.POINTER(ctypes.c_float), ]

def xnd_gpu(size):
    addr = gpu.get_array(size)
    a = np.ctypeslib.as_array(addr, shape=(size,))
    x = xnd.from_buffer(a)
    return x, addr

def del_gpu(addr):
    gpu.del_array(addr)

xnd_gpu returns an XND container (and its data pointer) whose data live in the unified memory, and del_gpu frees the memory referenced by a pointer.

Now we need to generate the gumath kernel for our add_gpu function. We save its prototype in the following file:

// file: add_gpu.h

extern void add_gpu(int n, float* x, float* y, float *r);

The corresponding configuration file looks like this:

# file: add_gpu-kernels.cfg

[MODULE add_gpu]
typemaps =
     float: float32
     int: int32
includes =
     add_gpu.h
include_dirs =
sources =
     add_gpu.c

libraries =

library_dirs =

header_code =
kinds = C
ellipses = none

[KERNEL add_gpu]
prototypes =
     void add_gpu(int   n, float *  x, float *  y, float *  r);
description =
dimension = x(n), y(n), r(n)
input_arguments = x, y
inplace_arguments = r
hide_arguments = n = len(x)

We can now generate the kernel:

$ xnd_tools kernel add_gpu-kernels.cfg
$ xnd_tools module add_gpu-kernels.cfg

And create a static library:

$ nvcc --compiler-options '-fPIC' -c add_gpu.cu
$ gcc -fPIC                                               \
  -c add_gpu-kernels.c                                    \
  -c $SITE_PACKAGES/xndtools/kernel_generator/xndtools.c  \
  -I$SITE_PACKAGES/xndtools/kernel_generator              \
  -I$SITE_PACKAGES/xnd                                    \
  -I$SITE_PACKAGES/ndtypes                                \
  -I$SITE_PACKAGES/gumath
$ ar rcs libadd_gpu-kernels.a add_gpu.o add_gpu-kernels.o xndtools.o

Finally, launch python setup.py install with this setup.py file:

# file: setup.py

from distutils.core import setup, Extension
from distutils.sysconfig import get_python_lib

site_packages = get_python_lib()
lib_dirs = [f'{site_packages}/{i}' for i in ['ndtypes', 'gumath', 'xnd']]

module1 = Extension('add_gpu',
                    include_dirs = lib_dirs,
                    libraries = ['add_gpu-kernels', 'ndtypes','gumath', 'xnd', 'cudart', 'stdc++'],
                    library_dirs = ['.', '/usr/local/cuda-9.2/lib64'] + lib_dirs,
                    sources = ['add_gpu-python.c'])

setup (name = 'add_gpu',
       version = '1.0',
       description = 'This is a gumath kernel extension that adds two XND containers on the GPU',
       ext_modules = [module1])

If everything went fine, you should be able to run the kernel on the GPU:

>>> from gpu import xnd_gpu, del_gpu
>>> from add_gpu import add_gpu
>>> size = 1 << 20
>>> x0, a0 = xnd_gpu(size)
>>> x1, a1 = xnd_gpu(size)
>>> x2, a2 = xnd_gpu(size)
>>> for i in range(size):
...     x0[i] = i
...     x1[i] = 1
>>> x0
xnd([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, ...], type='1048576 * float32')
>>> x1
xnd([1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...], type='1048576 * float32')
>>> add_gpu(x0, x1, x2)
>>> x2
xnd([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...], type='1048576 * float32')
>>> del_gpu(a0)
>>> del_gpu(a1)
>>> del_gpu(a2)