Compilation & Execution

Compilation & Execution

CUDAnative.@cudaMacro.
@cuda [kwargs...] func(args...)

High-level interface for calling functions on a GPU, queues a kernel launch on the current context. The @cuda macro should prefix a kernel invocation, with one of the following arguments in the kwargs position:

Affecting the kernel launch:

  • threads (defaults to 1)
  • blocks (defaults to 1)
  • shmem (defaults to 0)
  • stream (defaults to the default stream)

Affecting the kernel compilation:

  • minthreads: the required number of threads in a thread block.
  • maxthreads: the maximum number of threads in a thread block.
  • blockspersm: a minimum number of thread blocks to be scheduled on a single multiprocessor.
  • maxregs: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)
  • alias: an identifier that will be used for naming the kernel in generated code (useful for profiling, debugging, ...)

Note that, contrary to with CUDA C, you can invoke the same kernel multiple times with different compilation parameters. New code will be generated automatically.

The func argument should be a valid Julia function. Its return values will be ignored, by means of a wrapper. The function will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically (see cudaconvert). Finally, a call to cudacall is performed, scheduling the compiled function for execution on the GPU.

source
cudaconvert(x)

This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.

For CuArray objects, a corresponding CuDeviceArray object in global space is returned, which implements GPU-compatible array functionality.

source

Return the nearest number of threads that is a multiple of the warp size of a device:

nearest_warpsize(dev::CuDevice, threads::Integer)

This is a common requirement, eg. when using shuffle intrinsics.

source