Quickstart

Terminology

Because CUDA is the most popular GPU programming environment, we can use it as a reference for defining terminology in KA. A workgroup is called a block in NVIDIA CUDA and designates a group of threads acting in parallel, preferably in lockstep. For the GPU, the workgroup size is typically around 256, while for the CPU, it is usually a multiple of the natural vector-width. An ndrange is called a grid in NVIDIA CUDA and designates the total number of work items. If using a workgroup of size 1 (non-parallel execution), the ndrange is the number of items to iterate over in a loop.

Writing your first kernel

Kernel functions are marked with the @kernel. Inside the @kernel macro you can use the kernel language. As an example, the mul2 kernel below will multiply each element of the array A by 2. It uses the @index macro to obtain the global linear index of the current work item.

@kernel function mul2_kernel(A)
  I = @index(Global)
  A[I] = 2 * A[I]
end

Launching kernel on the host

You can construct a kernel for a specific backend by calling the kernel with mul2_kernel(CPU(), 16). The first argument is a device of type KA.Device, the second argument being the workgroup size. This returns a generated kernel executable that is then executed with the input argument A and the additional argument being a static ndrange.

A = ones(1024, 1024)
ev = mul2_kernel(CPU(), 64)(A, ndrange=size(A))
wait(ev)
all(A .== 2.0)

The kernel eventually returns an event ev. All kernels are launched asynchronously with the event ev specifies the current state of the execution. The [wait] blocks the host until the event ev has completed on the device. This implies that the host will launch no new kernels on any device until the wait returns.

Launching kernel on the device

To launch the kernel on a backend-supported device isa(device, KA.GPU) (e.g., CUDADevice(), ROCDevice(), oneDevice()), we generate the kernel for this device provided by CUDAKernels, ROCKernels, or oneAPIKernels.

First, we initialize the array using the Array constructor of the chosen device with

using CUDAKernels # Required to access CUDADevice
A = CuArray(ones(1024, 1024))
using ROCKernels # Required to access CUDADevice
A = ROCArray(ones(1024, 1024))
using oneAPIKernels # Required to access CUDADevice
A = oneArray(ones(1024, 1024))

The kernel generation and execution are then

ev = mul2_kernel(device, 64)(A, ndrange=size(A))
wait(ev)
all(A .== 2.0)

For simplicity, we stick with the case of device=CUDADevice().

Synchronization

Danger

All kernel launches are asynchronous, each kernel produces an event token that has to be waited upon, before reading or writing memory that was passed as an argument to the kernel. See dependencies for a full explanation.

The code around KA may heavily rely on GPUArrays, for example, to intialize variables.

using CUDAKernels # Required to access CUDADevice
function mymul(A::CuArray)
    A .= 1.0
    ev = mul2_kernel(CUDADevice(), 64)(A, ndrange=size(A))
    wait(ev)
    all(A .== 2.0)
end

These statement-level generated kernels like A .= 1.0 are executed on a different stream than the KA kernels. Launching mul_kernel may start before A .= 1.0 has completed. To prevent this, we add a device-wide dependency to the kernel by adding dependencies=Event(CUDADevice()).

ev = mul_kernel(CUDADevice(), 64)(A, ndrange=size(A), dependencies=Event(CUDADevice()))

This device dependency requires all kernels on the device to be completed before this kernel is launched. In the same vein, multiple events may be added to a wait.

using CUDAKernels # Required to access CUDADevice
function mymul(A::CuArray, B::CuArray)
    A .= 1.0
    B .= 3.0
    evA = mul2_kernel(CUDADevice(), 64)(A, ndrange=size(A), dependencies=Event(CUDADevice()))
    evB = mul2_kernel(CUDADevice(), 64)(A, ndrange=size(A), dependencies=Event(CUDADevice()))
    wait(evA, evB)
    all(A .+ B .== 8.0)
end