Compilation & Execution

Compilation & Execution

CUDAnative.@cudaMacro.
@cuda (gridDim::CuDim, blockDim::CuDim, [shmem::Int], [stream::CuStream]) func(args...)

High-level interface for calling functions on a GPU, queues a kernel launch on the current context. The gridDim and blockDim arguments represent the launch configuration, the optional shmem parameter specifies how much bytes of dynamic shared memory should be allocated (defaulting to 0), while the optional stream parameter indicates on which stream the launch should be scheduled.

The func argument should be a valid Julia function. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically (see cudaconvert). Finally, a call to cudacall is performed, scheduling the compiled function for execution on the GPU.

source
cudaconvert(x)

This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.

For CuArray objects, a corresponding CuDeviceArray object in global space is returned, which implements GPU-compatible array functionality.

source

Return the nearest number of threads that is a multiple of the warp size of a device:

nearest_warpsize(dev::CuDevice, threads::Integer)

This is a common requirement, eg. when using shuffle intrinsics.

source