Because CUDA is the most popular GPU programming environment, we can use it as a reference for defining terminology in KA. A workgroup is called a block in NVIDIA CUDA and designates a group of threads acting in parallel, preferably in lockstep. For the GPU, the workgroup size is typically around 256, while for the CPU, it is usually a multiple of the natural vector-width. An ndrange is called a grid in NVIDIA CUDA and designates the total number of work items. If using a workgroup of size 1 (non-parallel execution), the ndrange is the number of items to iterate over in a loop.

Writing your first kernel

Kernel functions are marked with the @kernel. Inside the @kernel macro you can use the kernel language. As an example, the mul2 kernel below will multiply each element of the array A by 2. It uses the @index macro to obtain the global linear index of the current work item.

@kernel function mul2_kernel(A)
  I = @index(Global)
  A[I] = 2 * A[I]

Launching kernel on the host

You can construct a kernel for a specific backend by calling the kernel with mul2_kernel(CPU(), 16). The first argument is a backend of type KA.Backend, the second argument being the workgroup size. This returns a generated kernel executable that is then executed with the input argument A and the additional argument being a static ndrange.

dev = CPU()
A = ones(1024, 1024)
ev = mul2_kernel(dev, 64)(A, ndrange=size(A))
all(A .== 2.0)

All kernels are launched asynchronously. The synchronize blocks the host until the kernel has completed on the backend.

Launching kernel on the backend

To launch the kernel on a backend-supported backend isa(backend, KA.GPU) (e.g., CUDABackend(), ROCBackend(), oneBackend()), we generate the kernel for this backend provided by CUDAKernels, ROCKernels, or oneAPIKernels.

First, we initialize the array using the Array constructor of the chosen backend with

using CUDAKernels # Required to access CUDABackend
A = CuArray(ones(1024, 1024))
using ROCKernels # Required to access ROCBackend
A = ROCArray(ones(1024, 1024))
using oneAPIKernels # Required to access oneBackend
A = oneArray(ones(1024, 1024))

The kernel generation and execution are then

mul2_kernel(backend, 64)(A, ndrange=size(A))
all(A .== 2.0)

For simplicity, we stick with the case of backend=CUDABackend().



All kernel launches are asynchronous, use synchronize(backend) to wait on a series of kernel launches.

The code around KA may heavily rely on GPUArrays, for example, to intialize variables.

using CUDAKernels # Required to access CUDABackend
function mymul(A::CuArray)
    A .= 1.0
    ev = mul2_kernel(CUDABackend(), 64)(A, ndrange=size(A))
    all(A .== 2.0)
using CUDAKernels # Required to access CUDABackend
function mymul(A::CuArray, B::CuArray)
    A .= 1.0
    B .= 3.0
    mul2_kernel(CUDABackend(), 64)(A, ndrange=size(A))
    mul2_kernel(CUDABackend(), 64)(A, ndrange=size(A))
    all(A .+ B .== 8.0)

Using task programming to launch kernels in parallel.